-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Google Custom Searchability #520
Comments
I know very little about SEO. Do you think it will be helpful to just change |
I don't think that would change anything. I'm not familiar with SEO either. I've been told that the main reasons HTML is preferred to PDF is because HTML supports dynamic content, such as links, and that it is more easily searchable and recognizable by web-crawlers. Is there a setting for this converter that still maintains links ( ) ? I noticed that this converter didn't preserve the functionality of my links. On the bright side, this PDF to HTML converter looks exactly the same as a pdf! But it also seems to function the same as a PDF as well. If so, what's the point? |
pdf2htmlEX should be able to convert links in PDF to HTML links;. If not, you can file an issue. PDFs are not supported by all browsers natively (maybe never will), so if you want your PDFs to be reliably accessible on the web, converting to HTML is a good idea. If you want to add more dynamic contents, you can always edit the converted HTML/JS. |
Ok, yeah. That makes sense. Will you be looking into this SEO issue? I will let you know if I find anything. |
I'm afraid I don't have necessary environment to do trial and error on SEO. If you can figure out why the output of pdf2htmlEX is not searchable by google, maybe I can improve it. |
If |
@KrishnaPG Thanks for the detailed suggestion! However, I would be suprised if search engines couldn't handle noise of span tags -- just removing these tags while keeping text nodes should produce correct text. Do you have any references on this? Using |
You might want to look at https://github.com/fmalina/transcript, a post processing tool for PDFtoHMLEx output providing semantic HTML based on visual design conventions. |
@fmalina Interesting, thanks! |
Is there a way to optimize the conversion so that the resulting HTML is (google)searchable?
The resulting HTML code is not searchable by google. I realized this when trying to create a custom search engine for the sites I have converted using this tool.
This seems like a fatal flaw because one of the main benefits for html over pdf is SEO.
This is the google custom search tool I am talking about:
https://cse.google.com/cse/
Converted with pdf2htmlEX and NOT searchable with google custom search:
http://education.byu.edu/sites/default/shared/code/SEEL/Library/File_Structure/pre_k/letter-knowledge/a/aachoo-andy/html/pre_k--letter-knowledge--a--aachoo-andy--lesson-plan.html
Converted using DreamWeaver and works with google custom search:
http://education.byu.edu/seel/LessonPlans/Pre-K/Alliteration/A/a_aachoo_andy_alliteration_activity_plans_and_resources.html?iframe=true&width=100%&height=100%;
The text was updated successfully, but these errors were encountered: