Support TextPositionSelector (in the dom package) #75

Treora · 2020-05-03T16:54:23Z

Altough it looks simple, there may be challenges in ensuring we count characters correctly. From the spec (in the TextQuoteSelector section, but that is then referred to by the TextPositionSelector section):

The selection of the text MUST be in terms of unicode code points (the "character number"), not in terms of code units (that number expressed using a selected data type). Selections SHOULD NOT start or end in the middle of a grapheme cluster. The selection MUST be based on the logical order of the text, rather than the visual order, especially for bidirectional text. For more information about the character model of text used on the web, see charmod.

The text MUST be normalized before recording in the Annotation. Thus HTML/XML tags SHOULD be removed, and character entities SHOULD be replaced with the character that they encode.

The referenced ‘charmod’ (Character Model for the WWW) has a section on string indexing that may be relevant.

What still confuses me a little is what constitutes the exact text of a DOM. Given that normalisation should (why not must?) remove html tags, I suppose this assumes we deal with the source html.

What then to do with comments: are those text, or are their  parts to be removed? In the latter case, would the document’s total text equal the textContent of all children of the Document? (one may think document.documentElement.textContent, but that excludes whitespace and comments outside the <html> element)

Possibly more problematic, can one even access the source html accurately enough through the DOM? Might a source parser have modified whitespace, thus leading to miscounts? I am not even talking about executed scripts that may modify the DOM too, I suppose we have to disregard that scenario.

Of course there are implementations already whose approach and behaviour we could copy, but it may be good to do the exercise of implementing based on the spec to ensure that it matches up, also to help detect discrepancies between implementations and spots where the spec may need to be improved/updated.

Any differences in implementations would likely result in misanchored annotations, so doing this imprecisely seems of little value; unless the use is explicitly limited to only apply to e.g. selector refinement within text nodes, which could be a strategy to take.

@tilgovi (or others): what are your thoughts about this, and about the implementation as it is done in dom-anchor-text-position, in Hypothesis, or elsewhere?

The text was updated successfully, but these errors were encountered:

tilgovi · 2020-08-16T22:40:23Z

The text MUST be normalized before recording the annotation. Thus HTML/XML tags SHOULD be removed and character entities SHOULD be replaced with the character that they encode.

As long as we stick to textContent or innerText, we are covered here. The tags are not part of this and entities are already replaced.

Possibly more problematic, can one even access the source html accurately enough through the DOM? Might a source parser have modified whitespace, thus leading to miscounts? I am not even talking about executed scripts that may modify the DOM too, I suppose we have to disregard that scenario.

We should be fine, at least for text nodes. The CSS white space properties make it important that parsers preserve the text nodes as is.

I think the spec is still somewhat vague and open to interpretation. I am partial to using innerText because it's the closest to the actual presentation. Regardless of what we choose, we have some work that supports any decision and helps us handle characters with multiple code units.

Iterating over a string in JavaScript yields strings representing the code points (each iteration may yield a string with more than one code units). As a result, one can also do [...string] and get an array of the code points. If we write a generic text position selector in terms of iteration over code points then we can compose it with anything that generates such an iterator from any other source, like generating innerText from a DOM Node or Range. The simplest thing is to call it with [...string].

However, I think we should consider going a step further and writing a text selector that consumes an iterator that yields chunks rather than receiving a full text with the initial call. This interface would be useful for streaming scenarios where the whole text may not be available or may be extremely large. The chunks themselves could be arrays or strings, and if we decide that they are strings we may wish to iterate over their code points.

Treora · 2020-10-08T21:31:04Z

I started implementing this in the text-position branch. I started by simply copying the code and tests from text quote selection, and adapting them as needed.

Implement TextPositionSelector, create Chunking abstraction. Fixes #85, #75.

Treora · 2020-12-24T17:41:16Z

Implemented in #98.

Treora mentioned this issue May 3, 2020

Upgrade to Apache Annotator BigBlueHat/page-notes#18

Open

Treora mentioned this issue Jul 16, 2020

Import @tilgovi's dom-anchor-* libraries #38

Closed

Treora added the let’s code Things waiting to be implemented label Jul 16, 2020

Treora mentioned this issue Aug 17, 2020

‘Chunking’ abstraction #85

Closed

Treora self-assigned this Oct 8, 2020

Treora mentioned this issue Nov 20, 2020

Implement TextPositionSelector, create Chunking abstraction #98

Merged

Treora added a commit that referenced this issue Dec 24, 2020

Merge pull request #98 from apache/text-position

6ecfaa2

Implement TextPositionSelector, create Chunking abstraction. Fixes #85, #75.

Treora closed this as completed Dec 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support TextPositionSelector (in the dom package) #75

Support TextPositionSelector (in the dom package) #75

Treora commented May 3, 2020

tilgovi commented Aug 16, 2020

Treora commented Oct 8, 2020

Treora commented Dec 24, 2020

Support TextPositionSelector (in the dom package) #75

Support TextPositionSelector (in the dom package) #75

Comments

Treora commented May 3, 2020

tilgovi commented Aug 16, 2020

Treora commented Oct 8, 2020

Treora commented Dec 24, 2020