Skip to content

v0.6.0

Latest

Choose a tag to compare

@aaronstevenwhite aaronstevenwhite released this 29 May 17:56
e9e14ef

Added

bead.corpus — streaming corpus ingestion and structural sampling

  • New subpackage bead.corpus for turning raw text corpora into experimental
    Items. CorpusRecord carries text plus flat provenance; CorpusSource is
    a streaming-source protocol.
  • Sources: JsonlCorpusSource (JSON Lines, transparently decompressing
    Zstandard .zst files), CsvCorpusSource (CSV/TSV), and
    CompletionCorpusSource (a language model as a corpus source, via the new
    TextGenerator protocol on the OpenAI and Anthropic adapters).
  • Lazy pipeline: parse_records, filter_by_structure, sample_corpus, and
    record_to_item stream records through a dependency parser and keep only
    those whose parse satisfies a structural DSL constraint, producing Items
    with standoff parse annotations and source provenance. The pipeline never
    loads the full corpus into memory.
  • New corpus optional-dependency extra (zstandard).

Dependency parsing in bead.tokenization

  • New bead.tokenization.parsers: SpacyParser, StanzaParser, and
    create_parser produce a per-sentence ParsedSentence of ParsedToken
    records (token, lemma, upos, xpos, head, deprel, morphology, offsets).
  • parse_to_spans projects a dependency parse onto the standoff Span +
    SpanRelation models: one single-token span per token (with its governor as
    head_index and its features in span_metadata) and one directed
    head-to-dependent relation per syntactic arc.

Structural-query builtins in the constraint DSL

  • New bead.dsl standard-library functions query a dependency parse stored on
    an Item: upos, xpos, lemma_of, form_of, deprel, morph, head,
    dependents, has_relation, root, subtree, path_to_root,
    tokens_with_upos, tokens_with_deprel, any_deprel, and filter_upos.
    Constraints can now match syntactic structure, e.g.
    upos(self, root(self)) == "VERB" and len(dependents(self, root(self), "obj")) > 0.

Text transforms for corpus cleanup

  • New transforms in bead.transforms.text: MarkdownStripTransform,
    RedditCleanupTransform, and the split_sentences helper (parser-backed or
    regex fallback). The first two are registered in the default transform
    registry.

bead.corpus buffering graph tier

  • New bead.corpus.graph: CorpusGraph, a typed directed multidigraph of
    CorpusNodes and CorpusEdges (parallel typed edges allowed; trees are a
    special case), with traversal helpers (children, parents, roots,
    out_edges, in_edges, subtree, node_by_id).
  • New bead.corpus.assemble: assemble_graph buffers a record stream into a
    CorpusGraph, building edges from declarative EdgeSpecs or a runtime edge
    function. Reconstructs thread structure such as Reddit reply trees from
    parent_id/link_id. This tier is opt-in and layered on top of the
    streaming pipeline, which is untouched.

bead.interop.layers — lossless layers interop

  • New subpackage mapping bead data to and from the
    layers linguistic-annotation schema
    as law-verified didactic lenses (dx.Iso for bijections, dx.Lens with a
    complement for projections), so every round-trip is exact and verified.
  • Faithful mirror models for the layers shared defs and record types, each with
    a generic lossless MirrorIso to and from layers-shaped JSON (snake/camel
    case, feature maps, slug+uri enums, integer confidence, $type unions).
  • Bridge lenses map bead-native models onto layers constructs: CorpusRecord
    to an expression, CorpusGraph to a property graph (expressions,
    graphNodes, and a graphEdgeSet), and a dependency-parsed ParsedSentence
    to a tokenization plus part-of-speech and dependency annotationLayers. The
    lens complement holds the bead-only remainder (framework identity and fields
    layers has no slot for). Resource-overlap lenses map lexical items, lexicons,
    and templates to the layers resource constructs.
  • Mappings are validated against the layers lexicons, vendored as the
    vendor/layers git submodule, using the ATProto lexicon validator
    (@atproto/lexicon), proving every mapping produces schema-valid layers.

Changed

  • Minimum didactic raised to >=0.7.2 and panproto to >=0.51.0.
  • Streaming corpus ingestion is now lossless by default: JsonlCorpusSource
    and CsvCorpusSource retain every field (not just a configured subset), and
    non-scalar values round-trip through JSON rather than being stringified, so
    no source information is dropped at ingestion.