Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specific CFI Problems via the XPATH reference #470

Open
MaxDaten opened this issue Oct 4, 2016 · 3 comments
Open

Specific CFI Problems via the XPATH reference #470

MaxDaten opened this issue Oct 4, 2016 · 3 comments
Labels

Comments

@MaxDaten
Copy link

MaxDaten commented Oct 4, 2016

Hi,

I'm currently debugging a problem with a specific CFI generated by me.

First the relevant HTML section (not the complete HTML document!)

      <h2 id="h2-5" title="Kapitel 5"><a id="page_39"></a><span class="small">01100010110010</span> <b>5.</b> <span class="small">10010111010100</span></h2>

And now two CFIs. With the first CFI I can successfully reference theh2 tag in the document but with the second not the first whitespace before the first opening b tag. The corresponding XPath is inspected here

startContainer = doc.evaluate(xpath, doc, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;

  • Working
    • CFI "epubcfi(/6/18!/4/352[h2-5]"
    • XPath: "./*/*[2]/*[position()=176 and @id='h2-5']/*[1]
  • NOT working
    • CFI: "epubcfi(/6/18!/4/352[h2-5]/5:0)"
    • XPath: "./*/*[2]/*[position()=176 and @id='h2-5']/text()[2]"

I narrowed down the problem to a mismatch in the CFI and XPath spec. In CFI empty character data chunks are still indexed:

Consecutive (_potentially-empty_) chunks of character data are each assigned odd indices (i.e., starting at 1, followed by 3, etc.).

http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-path-child-ref

In XPATH only character data with at least one character is grouped into a text node.

Character data is grouped into text nodes. As much character data as possible is grouped into each text node: a text node never has an immediately following or preceding sibling that is a text node. The string-value of a text node is the character data. _A text node always has at least one character of data._

https://www.w3.org/TR/xpath/#section-Text-Nodes

Now the generated XPath (

EPUBJS.EpubCFI.prototype.generateXpathFromSteps = function(steps) {
) can reference text nodes that are not possible to reference.

Maybe a fallback to the parent node is a possible solution?

@fchasen
Copy link
Contributor

fchasen commented Oct 28, 2016

Sorry this slipped me by, thanks so much for writing this up.

I'll need to look into this a bit more, but I agree that a good approach for now is fallback to the parent.

@fchasen fchasen added the Bug label Oct 28, 2016
@khcpietro
Copy link

Is there any progress here? @MaxDaten pointed out exactly. CFI and Xpath treat empty chunk differently.

@khcpietro
Copy link

Here is my workaround.
You can append space chunk between tags after html content loaded.

document.body.innerHTML = document.body.innerHTML.replaceAll('><', '> <')

Sadly, this changes epub content itself, and can cause performance issue. However, If you more care about consistency between CFI and Xpath, you can try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants