Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting internal page links from PDF #156

Open
patbegg opened this issue Dec 20, 2014 · 5 comments
Open

Extracting internal page links from PDF #156

patbegg opened this issue Dec 20, 2014 · 5 comments

Comments

@patbegg
Copy link

patbegg commented Dec 20, 2014

Hi,
At present, when converting a PDF that contains links that have been created in the PDF they are extracted and displayed in the 'pagelinks' layer in the viewer. If you want to create an internal pagelink within the viewer you set the href to '#page-6', for example', go jump to page 6 of the document. However if there are links created in the PDF (as invisible rectangles, which works for email snd weblinks) and the weblink is set to '#page-6' then the links are not extracted. I have also tried adding the links as 'go to a page in the document' links, but these don't get extracted either.

Is it possible to create links in the PDF that link to internal pages, that will then be extracted during the conversion process?

EDIT: I can confirm that if you set the internal link value to 'http://#page-5' it will extract the links. But obviously the links have 'http://' prepended to them when what we want is just '#page-5' to jump to an internal page in the viewer.

Thanks,
Pat

@lakenen
Copy link
Contributor

lakenen commented Dec 30, 2014

I'm not familiar with the proper way to author PDF internal links, but the conversion should see them if they are authored properly. The #page-{n} that you see in the viewer is actually being created by viewer.js, and is not necessarily the original href of the internal link.

Here's an example of how to do this with Microsoft Word (click the download button to see the original .docx file). https://view-api.box.com/1/sessions/a6732dda0e8244b5bfb2965418a7cdd0/view

@patbegg
Copy link
Author

patbegg commented Dec 31, 2014

I think this must be to do with how your converter engine works. In the original word doc you sent the link has a href of '#Page1'. If I add a link in a PDF with the href '#Page1' it doesn't get extracted during the conversion from PDF to SVG/HTML/CSS. However if I add the same link in a Word (.docx) file it DOES extract the link. If I prepend 'http://' to the link in the PDF it WILL extract the link, but then they don't function correctly in the converted document.

Are you able to change the conversion engine so it extracts links that are internal links i.e. with a href of #Page3, for example, instead of discarding them as invalid links?

@lakenen
Copy link
Contributor

lakenen commented Dec 31, 2014

In the word doc I sent, Page1 was a bookmark I explicitly created in the
document. I am not sure how to create those bookmarks in PDFs. I'll loop in
the conversion team and get back to you.

On Tuesday, December 30, 2014, patbegg notifications@github.com wrote:

I think this must be to do with how your converter engine works. In the
original word doc you sent the link has a href of '#Page1'. If I add a link
in a PDF with the href '#Page1' it doesn't get extracted during the
conversion from PDF to SVG/HTML/CSS. However if I add the same link in a
Word (.docx) file it DOES extract the link. If I prepend 'http://' to the
link in the PDF it WILL extract the link, but then they don't function
correctly in the converted document.

Are you able to change the conversion engine so it extracts links that are
internal links i.e. with a href of #Page3, for example, instead of
discarding them as invalid links?


Reply to this email directly or view it on GitHub
#156 (comment).

@patbegg
Copy link
Author

patbegg commented Feb 15, 2015

Is there any news on this? It's a very common thing for us to have clients add links to the PDF or add them in InDesign and then convert to PDF. At present if i add a link to a PDF with the url '#page=4, before conversion, when converted I get this in the info.json: file://localhost/tmp/viewapi/workspace/convert-cb36ae56894f476aa43dfd437cd64b1b/#page-3

Can this be made uniform in some way rather than us using a regex to find the links?

@lakenen
Copy link
Contributor

lakenen commented Feb 18, 2015

@patbegg do you have an example document that exhibits this behavior? You can send it along to api@box.com or link a downloadable view api session URL here if you don't mind it being publicly accessible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants