Extracting internal page links from PDF #156

patbegg · 2014-12-20T13:31:57Z

Hi,
At present, when converting a PDF that contains links that have been created in the PDF they are extracted and displayed in the 'pagelinks' layer in the viewer. If you want to create an internal pagelink within the viewer you set the href to '#page-6', for example', go jump to page 6 of the document. However if there are links created in the PDF (as invisible rectangles, which works for email snd weblinks) and the weblink is set to '#page-6' then the links are not extracted. I have also tried adding the links as 'go to a page in the document' links, but these don't get extracted either.

Is it possible to create links in the PDF that link to internal pages, that will then be extracted during the conversion process?

EDIT: I can confirm that if you set the internal link value to 'http://#page-5' it will extract the links. But obviously the links have 'http://' prepended to them when what we want is just '#page-5' to jump to an internal page in the viewer.

Thanks,
Pat

lakenen · 2014-12-30T20:59:43Z

I'm not familiar with the proper way to author PDF internal links, but the conversion should see them if they are authored properly. The #page-{n} that you see in the viewer is actually being created by viewer.js, and is not necessarily the original href of the internal link.

Here's an example of how to do this with Microsoft Word (click the download button to see the original .docx file). https://view-api.box.com/1/sessions/a6732dda0e8244b5bfb2965418a7cdd0/view

patbegg · 2014-12-31T00:53:05Z

I think this must be to do with how your converter engine works. In the original word doc you sent the link has a href of '#Page1'. If I add a link in a PDF with the href '#Page1' it doesn't get extracted during the conversion from PDF to SVG/HTML/CSS. However if I add the same link in a Word (.docx) file it DOES extract the link. If I prepend 'http://' to the link in the PDF it WILL extract the link, but then they don't function correctly in the converted document.

Are you able to change the conversion engine so it extracts links that are internal links i.e. with a href of #Page3, for example, instead of discarding them as invalid links?

lakenen · 2014-12-31T17:40:23Z

In the word doc I sent, Page1 was a bookmark I explicitly created in the
document. I am not sure how to create those bookmarks in PDFs. I'll loop in
the conversion team and get back to you.

On Tuesday, December 30, 2014, patbegg notifications@github.com wrote:

I think this must be to do with how your converter engine works. In the
original word doc you sent the link has a href of '#Page1'. If I add a link
in a PDF with the href '#Page1' it doesn't get extracted during the
conversion from PDF to SVG/HTML/CSS. However if I add the same link in a
Word (.docx) file it DOES extract the link. If I prepend 'http://' to the
link in the PDF it WILL extract the link, but then they don't function
correctly in the converted document.

Are you able to change the conversion engine so it extracts links that are
internal links i.e. with a href of #Page3, for example, instead of
discarding them as invalid links?

—
Reply to this email directly or view it on GitHub
#156 (comment).

patbegg · 2015-02-15T10:16:22Z

Is there any news on this? It's a very common thing for us to have clients add links to the PDF or add them in InDesign and then convert to PDF. At present if i add a link to a PDF with the url '#page=4, before conversion, when converted I get this in the info.json: file://localhost/tmp/viewapi/workspace/convert-cb36ae56894f476aa43dfd437cd64b1b/#page-3

Can this be made uniform in some way rather than us using a regex to find the links?

lakenen · 2015-02-18T16:11:21Z

@patbegg do you have an example document that exhibits this behavior? You can send it along to api@box.com or link a downloadable view api session URL here if you don't mind it being publicly accessible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting internal page links from PDF #156

Extracting internal page links from PDF #156

patbegg commented Dec 20, 2014

lakenen commented Dec 30, 2014

patbegg commented Dec 31, 2014

lakenen commented Dec 31, 2014

patbegg commented Feb 15, 2015

lakenen commented Feb 18, 2015

Extracting internal page links from PDF #156

Extracting internal page links from PDF #156

Comments

patbegg commented Dec 20, 2014

lakenen commented Dec 30, 2014

patbegg commented Dec 31, 2014

lakenen commented Dec 31, 2014

patbegg commented Feb 15, 2015

lakenen commented Feb 18, 2015