Skip to content
This repository has been archived by the owner on Dec 9, 2018. It is now read-only.

Add API for search functionality #231

Open
rvanlaak opened this issue Oct 29, 2013 · 4 comments
Open

Add API for search functionality #231

rvanlaak opened this issue Oct 29, 2013 · 4 comments

Comments

@rvanlaak
Copy link

The result that pdf2htmlEX outputs is great, and is very suitable to replace Acrobat Reader. One of the features that makes Acrobat favorable above the browser output, is the ability to search in the document.

Feature request: add an search-API in the library, so it is possible to perform text-searches in the document.

Features of the API could be:

  • search (iterate through results / search direction)
  • search & replace
  • case (in)sensitive search
  • regular expression search
  • mark a selection
  • search in PDF bookmarks
  • add bookmarks to search results

When this API works, a next step could be to implement an GUI that makes use of this API. I will make another issue for that.

@coolwanglu
Copy link
Owner

replace is not possible, at least for now. I don't think it's event supported by PDF readers. I also doubt for add bookmarks

search in PDF bookmarks also sounds like a rare use case to me.

I'm not sure if innerText or :contains is enough for these features: see http://stackoverflow.com/questions/12445020/javascript-window-find-doesnt-work-absolutely

But indeed there is a problem when lazy loading is enabled: pages are not loaded until viewed, so we need to load them before searching for any text.

@iapain
Copy link
Collaborator

iapain commented Dec 26, 2013

Possible solution would be either searching text nodes in DOM and highlight them or generate inverted index to use in search (using https://github.com/fagbokforlaget/pdfiijs or pdftotext and feed it into indexing system).

@rvanlaak
Copy link
Author

@iapain the library you're proposing sounds great, certainly since I've got both a PDF-file and a pdftotext-output. Does the snowball-js support the following use-case?

My use-case is that I've got fragments from the pdftotext, that I would like to show/mark in the original PDF with its original markup. It would be awesome if I can use pdf2htmlEX in order to preserve the markup from the PDF.

@rvanlaak
Copy link
Author

I've been digging through the changelog / release notes / blogspot posts, and found out it is possible to search the output, and compare the html like diffs.

Can you elaborate a bit more on those features, because I could not find any documentation about that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants