Release v0.6.0 · FACTSlab/bead

New subpackage bead.corpus for turning raw text corpora into experimental
Items. CorpusRecord carries text plus flat provenance; CorpusSource is
a streaming-source protocol.
Sources: JsonlCorpusSource (JSON Lines, transparently decompressing
Zstandard .zst files), CsvCorpusSource (CSV/TSV), and
CompletionCorpusSource (a language model as a corpus source, via the new
TextGenerator protocol on the OpenAI and Anthropic adapters).
Lazy pipeline: parse_records, filter_by_structure, sample_corpus, and
record_to_item stream records through a dependency parser and keep only
those whose parse satisfies a structural DSL constraint, producing Items
with standoff parse annotations and source provenance. The pipeline never
loads the full corpus into memory.
New corpus optional-dependency extra (zstandard).

New bead.tokenization.parsers: SpacyParser, StanzaParser, and
create_parser produce a per-sentence ParsedSentence of ParsedToken
records (token, lemma, upos, xpos, head, deprel, morphology, offsets).
parse_to_spans projects a dependency parse onto the standoff Span +
SpanRelation models: one single-token span per token (with its governor as
head_index and its features in span_metadata) and one directed
head-to-dependent relation per syntactic arc.

New transforms in bead.transforms.text: MarkdownStripTransform,
RedditCleanupTransform, and the split_sentences helper (parser-backed or
regex fallback). The first two are registered in the default transform
registry.

New bead.corpus.graph: CorpusGraph, a typed directed multidigraph of
CorpusNodes and CorpusEdges (parallel typed edges allowed; trees are a
special case), with traversal helpers (children, parents, roots,
out_edges, in_edges, subtree, node_by_id).
New bead.corpus.assemble: assemble_graph buffers a record stream into a
CorpusGraph, building edges from declarative EdgeSpecs or a runtime edge
function. Reconstructs thread structure such as Reddit reply trees from
parent_id/link_id. This tier is opt-in and layered on top of the
streaming pipeline, which is untouched.

New subpackage mapping bead data to and from the
layers linguistic-annotation schema
as law-verified didactic lenses (dx.Iso for bijections, dx.Lens with a
complement for projections), so every round-trip is exact and verified.
Faithful mirror models for the layers shared defs and record types, each with
a generic lossless MirrorIso to and from layers-shaped JSON (snake/camel
case, feature maps, slug+uri enums, integer confidence, $type unions).
Bridge lenses map bead-native models onto layers constructs: CorpusRecord
to an expression, CorpusGraph to a property graph (expressions,
graphNodes, and a graphEdgeSet), and a dependency-parsed ParsedSentence
to a tokenization plus part-of-speech and dependency annotationLayers. The
lens complement holds the bead-only remainder (framework identity and fields
layers has no slot for). Resource-overlap lenses map lexical items, lexicons,
and templates to the layers resource constructs.
Mappings are validated against the layers lexicons, vendored as the
vendor/layers git submodule, using the ATProto lexicon validator
(@atproto/lexicon), proving every mapping produces schema-valid layers.

Minimum didactic raised to >=0.7.2 and panproto to >=0.51.0.
Streaming corpus ingestion is now lossless by default: JsonlCorpusSource
and CsvCorpusSource retain every field (not just a configured subset), and
non-scalar values round-trip through JSON rather than being stringified, so
no source information is dropped at ingestion.

Provide feedback