Specific CFI Problems via the XPATH reference #470

MaxDaten · 2016-10-04T23:44:03Z

Hi,

I'm currently debugging a problem with a specific CFI generated by me.

First the relevant HTML section (not the complete HTML document!)

      <h2 id="h2-5" title="Kapitel 5"><a id="page_39"></a><span class="small">01100010110010</span> <b>5.</b> <span class="small">10010111010100</span></h2>

And now two CFIs. With the first CFI I can successfully reference theh2 tag in the document but with the second not the first whitespace before the first opening b tag. The corresponding XPath is inspected here

epub.js/src/epubcfi.js

Line 511 in d8ba903

    
           startContainer = doc.evaluate(xpath, doc, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;

Working
- CFI "epubcfi(/6/18!/4/352[h2-5]"
- XPath: "./*/*[2]/*[position()=176 and @id='h2-5']/*[1]
NOT working
- CFI: "epubcfi(/6/18!/4/352[h2-5]/5:0)"
- XPath: "./*/*[2]/*[position()=176 and @id='h2-5']/text()[2]"

I narrowed down the problem to a mismatch in the CFI and XPath spec. In CFI empty character data chunks are still indexed:

Consecutive (_potentially-empty_) chunks of character data are each assigned odd indices (i.e., starting at 1, followed by 3, etc.).

http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-path-child-ref

In XPATH only character data with at least one character is grouped into a text node.

Character data is grouped into text nodes. As much character data as possible is grouped into each text node: a text node never has an immediately following or preceding sibling that is a text node. The string-value of a text node is the character data. _A text node always has at least one character of data._

https://www.w3.org/TR/xpath/#section-Text-Nodes

Now the generated XPath (

epub.js/src/epubcfi.js

Line 448 in d8ba903

EPUBJS.EpubCFI.prototype.generateXpathFromSteps = function(steps) {

) can reference text nodes that are not possible to reference.

Maybe a fallback to the parent node is a possible solution?

The text was updated successfully, but these errors were encountered:

fchasen · 2016-10-28T16:01:49Z

Sorry this slipped me by, thanks so much for writing this up.

I'll need to look into this a bit more, but I agree that a good approach for now is fallback to the parent.

khcpietro · 2021-08-23T07:05:57Z

Is there any progress here? @MaxDaten pointed out exactly. CFI and Xpath treat empty chunk differently.

khcpietro · 2021-09-15T02:49:29Z

Here is my workaround.
You can append space chunk between tags after html content loaded.

document.body.innerHTML = document.body.innerHTML.replaceAll('><', '> <')

Sadly, this changes epub content itself, and can cause performance issue. However, If you more care about consistency between CFI and Xpath, you can try.

fchasen added the Bug label Oct 28, 2016

johnfactotum mentioned this issue Sep 22, 2023

Readium cfi and epubjs cfi #1358

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specific CFI Problems via the XPATH reference #470

Specific CFI Problems via the XPATH reference #470

MaxDaten commented Oct 4, 2016

fchasen commented Oct 28, 2016

khcpietro commented Aug 23, 2021

khcpietro commented Sep 15, 2021

Specific CFI Problems via the XPATH reference #470

Specific CFI Problems via the XPATH reference #470

Comments

MaxDaten commented Oct 4, 2016

fchasen commented Oct 28, 2016

khcpietro commented Aug 23, 2021

khcpietro commented Sep 15, 2021