‘Chunking’ abstraction #85

Treora · 2020-08-17T11:28:18Z

In recent calls we (especially @tilgovi — so feel free to improve my description) have discussed an approach to allow text selector matching/describing implementations on other ‘document models’ than the DOM. A typical use case would be a (web) application that uses some framework (ProseMirror, React, …) to display documents, and therefore would not want the result of anchoring an annotation to be a Range object, but rather something that matches their internal representation of the document.

A discussed requirement is also that the document can be provided piecemeal and asynchronously, so that an application can try anchor selectors on documents that are not fully available yet (or just not fully converted to text yet, think e.g. PDF.js). We have been calling such pieces of text ‘chunks’ for now.

Currently, our text quote anchoring function (in the dom package) is hard-coded to search for text quote using Range, NodeIterator, TreeWalker. When using the chunk approach, this functionality should be composed of two parts: one generic text quote anchoring function that takes a stream of Chunks of text; and one dom-to-chunk converter that uses TreeWalkers and such to present the DOM as a stream of text Chunks.

I am creating this issue to discuss what exactly a Chunk would be (a string?), and what a stream of chunks would be (an AsyncIterable<Chunk>?), and how our generalised anchoring functions interact with chunk providers (e.g. do we need an equivalent of Range, how do we pass back string offsets, …?). And also to discuss the assumptions and requirements (are we on the right track?).

The text was updated successfully, but these errors were encountered:

Treora · 2020-08-17T11:54:07Z

@tilgovi alluded to this chunking idea in #75:

However, I think we should consider going a step further and writing a text selector that consumes an iterator that yields chunks rather than receiving a full text with the initial call. This interface would be useful for streaming scenarios where the whole text may not be available or may be extremely large. The chunks themselves could be arrays or strings, and if we decide that they are strings we may wish to iterate over their code points.

Just one idea this raised: would it go too far to just ask for a stream of characters, that we permit to be nested? A simple experiment:

function isCharacter(charOrStream) {
  return typeof charOrStream === 'string' && [...charOrStream].length === 1;
}

function printCharacters(iterable) {
  for (charOrStream of iterable) {
    if (!isCharacter(charOrStream)) {
      printCharacters(charOrStream);
    } else {
      console.log(charOrStream);
    }
  }
}

printCharacters('bla. ');
printCharacters(['two', ' chunks. '])
printCharacters(['chunks', [' could', ' nest.', '', ]]);

I suppose such an approach may be elegant for a specification or reference implementation, but perhaps making a stricter structure could help increase performance (iterating through a string’s individual characters might anyhow be required when anchoring a TextPositionSelector, but for a TextQuoteSelector it could be superfluous?)

Treora · 2020-09-16T18:03:25Z

Currently, our text quote anchoring function (in the dom package) is hard-coded to search for text quote using Range, NodeIterator, TreeWalker. When using the chunk approach, this functionality should be composed of two parts: one generic text quote anchoring function that takes a stream of Chunks of text; and one dom-to-chunk converter that uses TreeWalkers and such to present the DOM as a stream of text Chunks.

I started playing with this idea in the branch chunking.

In this first attempt a Chunk is anything that has a toString() method:

export interface Chunk {
  toString(): string;
}

And we can point at a part of the text using a straightforward generalisation of Range:

export interface ChunkRange<TChunk extends Chunk> {
  startChunk: TChunk;
  startIndex: number;
  endChunk: TChunk;
  endIndex: number;
}

I made an abstracted version of the text quote matcher that accepts as its scope an AsyncIterable<TChunk>:

export function abstractTextQuoteSelectorMatcher(
  selector: TextQuoteSelector,
): <TChunk extends Chunk>(textChunks: AsyncIterable<TChunk>) => AsyncIterable<ChunkRange<TChunk>> {

So one can throw any type of TChunk in, and get ranges using the same type back. For the concrete implementation, I actually used Ranges as the TChunk type, with each Range wrapping a single text node. (we can’t just throw in text nodes themselves, both because their toString() method does not return the text content, and because the first and the last node might be only partially part of the scope).

For the text quote matching this works fine (all tests pass). For describing a text quote however, it would be helpful to have more freedom to navigate the text, instead of only walking through it in a single pass (especially to find prefixes). I suppose our scope should have an API that is more like a TreeWalker than like a NodeIterator: jump to any spot, and then walk in either direction.

@tilgovi: Any thoughts about the approach to try, before I go further down this rabbit hole?

Treora added the discussion Issues without a clear plan for action label Aug 21, 2020

tilgovi assigned Treora Nov 5, 2020

Treora mentioned this issue Nov 20, 2020

Implement TextPositionSelector, create Chunking abstraction #98

Merged

Treora closed this as completed in 6ecfaa2 Dec 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

‘Chunking’ abstraction #85

‘Chunking’ abstraction #85

Treora commented Aug 17, 2020

Treora commented Aug 17, 2020

Treora commented Sep 16, 2020

‘Chunking’ abstraction #85

‘Chunking’ abstraction #85

Comments

Treora commented Aug 17, 2020

Treora commented Aug 17, 2020

Treora commented Sep 16, 2020