Skip to content
This repository has been archived by the owner on Dec 9, 2018. It is now read-only.

Google Custom Searchability #520

Open
GarrettHartley opened this issue May 8, 2015 · 10 comments
Open

Google Custom Searchability #520

GarrettHartley opened this issue May 8, 2015 · 10 comments

Comments

@GarrettHartley
Copy link

Is there a way to optimize the conversion so that the resulting HTML is (google)searchable?

The resulting HTML code is not searchable by google. I realized this when trying to create a custom search engine for the sites I have converted using this tool.

This seems like a fatal flaw because one of the main benefits for html over pdf is SEO.

This is the google custom search tool I am talking about:
https://cse.google.com/cse/

Converted with pdf2htmlEX and NOT searchable with google custom search:

http://education.byu.edu/sites/default/shared/code/SEEL/Library/File_Structure/pre_k/letter-knowledge/a/aachoo-andy/html/pre_k--letter-knowledge--a--aachoo-andy--lesson-plan.html

Converted using DreamWeaver and works with google custom search:

http://education.byu.edu/seel/LessonPlans/Pre-K/Alliteration/A/a_aachoo_andy_alliteration_activity_plans_and_resources.html?iframe=true&width=100%&height=100%;

@duanyao
Copy link
Collaborator

duanyao commented May 11, 2015

I know very little about SEO. Do you think it will be helpful to just change <div> used by pdf2htmlEX to more semantic <hN> and <p>?

@GarrettHartley
Copy link
Author

I don't think that would change anything. I'm not familiar with SEO either.

I've been told that the main reasons HTML is preferred to PDF is because HTML supports dynamic content, such as links, and that it is more easily searchable and recognizable by web-crawlers.

Is there a setting for this converter that still maintains links ( ) ? I noticed that this converter didn't preserve the functionality of my links.

On the bright side, this PDF to HTML converter looks exactly the same as a pdf!

But it also seems to function the same as a PDF as well. If so, what's the point?

@duanyao
Copy link
Collaborator

duanyao commented May 11, 2015

pdf2htmlEX should be able to convert links in PDF to HTML links;. If not, you can file an issue.

PDFs are not supported by all browsers natively (maybe never will), so if you want your PDFs to be reliably accessible on the web, converting to HTML is a good idea.

If you want to add more dynamic contents, you can always edit the converted HTML/JS.

@GarrettHartley
Copy link
Author

Ok, yeah. That makes sense.

Will you be looking into this SEO issue?

I will let you know if I find anything.

@duanyao
Copy link
Collaborator

duanyao commented May 11, 2015

I'm afraid I don't have necessary environment to do trial and error on SEO. If you can figure out why the output of pdf2htmlEX is not searchable by google, maybe I can improve it.

@coolwanglu
Copy link
Owner

If --split-pages is not enabled, text should be static in the HTML.

@KrishnaPG
Copy link

For the SEO, the basic need is keyword identification, which is difficult when words get split as individual characters. For example, consider the below html generated:

<div class="t m0 x5 hb y14 ff1 fsa fc7 sc0 ls0 ws0">Techno<span class="_ _9"></span>logy Stack </div>

The word Technology is split in the middle with a span tag. Which makes it impossible for the search engines to classify the document as 'technology' document. The main problem here is, the induced span tag is just accounting for 1.09 px which is not really worth the effort for HTML.

For example, here is the rendered html in browser (after turning off the span tag):
image

In PDF the 1.09 px could make large difference for different devices, but in HTML (which is essentially responsive, meaning different output for different devices), perhaps these intermediate span elements below certain threshold should be ignored and not be output (especially when they are breaking words).

One possible approach is:

*  eliminating/minimizing the tags insertion in the mid of text (where there is no white-space)
*  not generating `span` margins below certain (configurable?) thresholds (e.g. 5 px)
*  while, retaining the current pixel-level accuracy for non-textual content (images, control sequences etc.)

The second requirement for SEO is, using contextual tags, such as h1, h2, h3...

Presently the generated output uses div classes with varying font-size heights specified (such as <div class="... h1 t m0 x1...">).

Instead, using the <h1> tags with the same classes, such as <h1 class='...h1 t m0 x1 ...'> in place of div tag is one good option to consider here (after sorting the font-sizes and assigning them in the decreasing order)

The next important SEO features are using title and alt attributes for the links and images. But not sure if that would be easy without some external help.

There are other SEQ requirements such as responsiveness and page-loading speed etc. which I think can be tackled by the users.

one good way would be to let users choose between pixel-perfect (less SEO) vs text-perfect (good SEO, but may not be pixel-perfect as the PDF) when generating the output

Like the zip and H.264 codecs work, profiles on the scale of 0 to 9 mapped to pixel-perfect to text-perfect is one good way to go.

Pdf2HtmlEx might be already implementing most of these in one form or the other - its just a matter fine-tuning and figuring out which one works for SEO.

@duanyao
Copy link
Collaborator

duanyao commented Dec 11, 2015

@KrishnaPG Thanks for the detailed suggestion!

However, I would be suprised if search engines couldn't handle noise of span tags -- just removing these tags while keeping text nodes should produce correct text. Do you have any references on this?

Using <hN> and <p> instead of <div> and adding <title> are doable, however I'm not sure how to test the effect. If anyone can test, I suggest manually editing (or scripting if you can) the output of pdf2htmlEx and see what will happen.

@fmalina
Copy link

fmalina commented Dec 20, 2015

You might want to look at https://github.com/fmalina/transcript, a post processing tool for PDFtoHMLEx output providing semantic HTML based on visual design conventions.

@duanyao
Copy link
Collaborator

duanyao commented Dec 22, 2015

@fmalina Interesting, thanks!
@GarrettHartley can you try https://github.com/fmalina/transcript and see whether it makes difference?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants