Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

New feature --fit-every #57

Closed
wants to merge 2 commits into from

4 participants

Simanas Lu Wang Raza Mobin John Hewson
Simanas

C++ is not my favorite programming language, but I have managed to implement --fit-every option. If set to 1 and --fit-width/height is presented every page will be fitted to specified width/height. If --zoom is presented this option will have no affect.

Lu Wang coolwanglu commented on the diff
src/HTMLRenderer/general.cc
((25 lines not shown))
+ if(zoom_factors.empty())
+ {
+ zoom = 1.0;
+ }
+ else
+ {
+ zoom = *min_element(zoom_factors.begin(), zoom_factors.end());
+ }
+
+ text_scale_factor1 = max<double>(zoom, param->font_size_multiplier);
+ text_scale_factor2 = zoom / text_scale_factor1;
+}
+
+double HTMLRenderer::text_zoom_factor (int page_number){
+
+ if(is_positive(param->fit_every) && !is_positive(param->zoom)){
Lu Wang Owner

why check param->zoom here

Simanas
Simanas added a note

As we discussed earlier, if zoom is presented, fit_every should not affect anything.

Lu Wang Owner

why is that, it would be confusing.

Simanas
Simanas added a note

If we do not check it here, then if --zoom, --fit-width, --fit-every is presented it would create unwanted results, because in document with different page sizes this method *min_element(zoom_factors.begin(), zoom_factors.end()) on some pages will return zoom value on some calculated zoom value from --fit-width value.

Lu Wang Owner

Currently, if more than one of --zoom, --fit-width/height is specified, the "smallest one" will be used.
"--zoom disables --fit-every" is not intuitive to me, and not even written in the manpage.

Simanas
Simanas added a note

You are wrong. Currently the smallest of --zoom, --fit-width/document_width and --fit-height/document_width is used to get text_scale_factors for every page.

In my modification if you remove param->zoom check, then for every page the smallest of --zoom, --fit-width/page_width and --fit-height/page_width will be used, which would be a stupid behavior. :)

Simanas
Simanas added a note

Yeah... man page should be modificated if you decide to accept this change.

Lu Wang Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Lu Wang
Owner

Thanks for your effort. I'm reviewing the commit.
Still I think it should be part of a PDF manipulating tool, how about the feature to reorder the pages?

Can you provide with the PDF you mentioned, with different sizes of each page? (but you want them equal)

Simanas

I made this improvement because I really needed it, although c++ is not my favorite language. If I needed this, there is a high probability that it will be needed for someone else in the future.
Since you share your code, I share my improvement to. Anyway you are the master here and you decide include this to your master branch or not. :)
There is no need to provide any pdf file to prove that pdf's with different page sizes exists. They just exists.

Lu Wang
Owner

Yeah I believe they exist, but I've never seen one. I just want to know why the sizes are different while they should have been.

Simanas

In most cases that I have seen is that some of the pages are flipped 90 degrees. For example some report where most of the pages are in portrait mode, but some are in landscape because they contain charts, tables etc...

Lu Wang
Owner

So if a few pages are in the wrong direction, why would you fix it by zooming instead of rotating?

Simanas

Because in real life it is meant to be read in landscape mode, while the others in portrait, therefore an author of pdf file had flipped these pages 90 degrees so you don't have to rotate your computer screen, while trying to read such a page of a report via adobe acrobat reader.

But this is not the only case... Since every page in a pdf document can have its own width/height and it is absolutely acceptable in pdf document (you can easily zoom in and zoom out with any pdf viewer), it may be absolutely unacceptable when you want to produce one consistent 960px wide html document.

Lu Wang
Owner
Raza Mobin

This is an example of a PDF with different page sizes:

https://dl.dropbox.com/u/31309918/dd/0qDiSazT62.pdf

Lu Wang
Owner

@razamobin

Yes, and in most PDF viewers, all pages are zoomed in/out together. There's no option to make all of them a same width.

Simanas

Man you are blocked your self really serious! razamobin gaved you a perfect example of such a document! There is plenty of them all around the world and if you want to convert unlimited amount of documents to html where html document always has to be exactly 960px in width you are stuck if you do not have this option in your converter!

If you want your product to be the best you have to click "accept button" in your head and here more often!

Sorry for my anger, but you drive me crazy with this. I have done all the work, it is only one option now which is perfectly functional and all you have to do is to click accept.

Lu Wang
Owner

@Simanas Please relax, I appreciate your efforts, but I just don't think "to convert a html document with exactly 960px in width, from a pdf file contains pages with different widths" is a common feature that should be included.

I need more evidence about this, and that's why I've left it open and marked it as 'need vote' instead of just closing it.

Please keep in mind that I'm the author & maintainer, once this feature is included, it has to be "maintained". Suppose another people really wants all odd pages rotated CW and even pages rotated CCW, do you think it should be included in pdf2htmlEX ? I've actually created special features for a few people, for different purposes, but I never merged them into master.

Also note that there is no 2nd people supports this request till now, despite of the 'need vote' label.

As I said earlier, I'm waiting for more evidences. Testimonies, similar features in other PDF CONVERSION tools etc.

Simanas

whatever...

Lu Wang
Owner

@Simanas, When I'm stuck at some missing options of some software, I modify it myself.
When I submit the patch but rejected by the author, I know that it's my personal need, and I'm still happy with my fork.

John Hewson

It's not the usual behaviour of PDF readers to re-scale on a page-by-page basis. I just checked Adobe Acrobat, Chrome, and OSX Preview, and non of them do that. The standard method is "Zoom to Fit" where the entire document is scaled by a single factor. This is the same behaviour as --fit-width <arg> provides.

@Simanas, --fit-width and --zoom are mutually exclusive. If you're trying to fit to a given width such as 960px then use --fit-width. If you're trying to implement "Zoom to Fit" then you'll need to do that in CSS in the browser. However if you're trying to zoom each page with it's own specific scale, then this is not a standard feature of PDF viewers, as it is no longer a faithful rendering of the document. If you want to do that, then I'd recommend pre-processing the PDF via another tool.

My vote is that pdf2htmlEX should stick to doing one thing and do it well, and avoid feature creep.

Simanas

So please tell me is this piece of software is aimed to serve as a pdf VIEWER or is it aimed to be a CONVERTER, because all your args are from series: "Usual pdf viewers do not have such a functionality, therefore..."

If it is only a VIEWER, than please apologize me, I was to emotional. The less options you have, the more easy to use and understand your viewer gets, since you just plug the pdf and pull the html. I agree, than it has to bee simple as 2 * 2. Afterwards it is probably the most complicated way to represent pdf documents on the web.

But if you treat it as a CONVERTER, then you would probably like to have a bunch of possible options to produce converted result that suits to your needs the best. The more options you can chose from before you start processing the a unlimited amount of files, the more universal and flexible CONVERTER gets.

Lu Wang
Owner
John Hewson

@Simanas a conversion should faithfully reproduce the input, just in a different format. You're wanting to manipulate the output by re-sizing individual pages, which actually means you're breaking the concept of page size.

So while re-scaling individual pages may fix your problem, it is fundamentally wrong. You can easily manipulate a PDF file in this manner using other tools, before converting the file with pdf2htmlex. There is no need to change pdf2htmlex to achieve this.

If you can achieve your goal by manipulating a PDF file beforehand, then you can see it is not a good candidate feature for pdf2htmlex. There has to be a separation of concerns, there's no point re-implementing functionality which you can get from existing tools

Simanas

Ok now we are clear. Now I understand your vision.

I think, that if you are converting something from one format to the other you should think not only on how to convert it so that it would be exactly the same looking and acting thing like it was in the original format, but you have to think on the different concepts of formats.

As I mentioned earlier pdf is dedicated to representa a content in a FIXED style and layout, while HTML has a very different manner. It is intended to represent CONTENT in a format that is the best for web browser.

You will probably never want to manipulate original pdf layout sizes to fit them to e.g. 960dpi, since it is intended to be as it was originally designed, but if you want to extract the content from those layouts and display it on a screen, you will need this solution.

It makes me a perfect sense to have such an option in this converter, since you are converting from one concept of data visualization to the other. I treat it not as a manipulation, because I see this piece of software in a much broader, than just technical context.

John Hewson
Simanas

I know I can use different tools to achieve this but it makes me no sense. I see that we are in a never ending discussion and I think that there is no one "best" way how to treat this.

The fact is that PDF is a fixed layout format, and breaking that breaks what a PDF fundamentally is

I think we are already breaking things down just by converting a content form pdf to html.
We are pulling out a html not pdf, so why we should bother our selves on how to not break fundamental pdf rules? It makes me no sense to follow pdf rules, since it is no longer a pdf document.
Why not to add as much as possible flexibility to create true HTML, not only a replica of pdf in a html format.

John Hewson
Lu Wang
Owner

"why we should bother our selves on how to not break fundamental pdf rules? It makes me no sense to follow pdf rules, since it is no longer a pdf document. "

Why? No why, this is by design. You don't want to see anything changed if you convert a PNG to a JPEG, do you?

Earlier I've tried to make text reflowable, which might be something like what you meant about "representation of the content", or "True HTML". But this link shows the difficulties I encountered. I will do that once I find a solution some day. I still want to optimize for HTML without breaking visual accuracy.

I'm the one who is responsible for the quality of pdf2htmlEX, so I'm now playing the AUTHOR card. As written in README, accurate rendering is the #1 concern of pdf2htmlEX. And according to your definitions, pdf2htmlEX is absolutely a PDF viewer.

@Simanas, still I appreciate your efforts, including the patch and all argument here. I believe that it's useful for you and some other people, but right now I don't think it's the right feature for pdf2htmlEX. Anyone can possibly change/consolidate my mind by showing:

  • Agreement/disagreement with this feature
  • Existence/absence of similar functions in other PDF conversion tools

I will still leave it open for a few months, but might not always response to all messages

Simanas

Ok. Now I fully understand your vision and you are right such an option does not fit in it at all.

Anyway I really appreciate your efforts in developing this and please excuse me if I was to emotional. :)

Lu Wang
Owner

Closed due to lack of supporting votes.
Thank you very much for your efforts all the same.

Lu Wang coolwanglu closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Dec 12, 2012
  1. Simanas

    Added option --fit-every

    Simanas authored
  2. Simanas
This page is out of date. Refresh to see the latest.
3  src/HTMLRenderer/HTMLRenderer.h
View
@@ -303,7 +303,8 @@ class HTMLRenderer : public OutputDev
* factor1 & factor 2 are determined according to zoom and font-size-multiplier
*
*/
- double text_zoom_factor (void) const { return text_scale_factor1 * text_scale_factor2; }
+ void determine_scale_factors(int width, int height);
+ double text_zoom_factor (int page_number);
double text_scale_factor1;
double text_scale_factor2;
78 src/HTMLRenderer/general.cc
View
@@ -106,7 +106,7 @@ void HTMLRenderer::process(PDFDoc *doc)
}
doc->displayPage(this, i,
- text_zoom_factor() * DEFAULT_DPI, text_zoom_factor() * DEFAULT_DPI,
+ text_zoom_factor(i) * DEFAULT_DPI, text_zoom_factor(i) * DEFAULT_DPI,
0,
(param->use_cropbox == 0),
false, false,
@@ -235,38 +235,7 @@ void HTMLRenderer::pre_process(PDFDoc * doc)
/*
* determine scale factors
*/
- {
- double zoom = 1.0;
-
- vector<double> zoom_factors;
-
- if(is_positive(param->zoom))
- {
- zoom_factors.push_back(param->zoom);
- }
-
- if(is_positive(param->fit_width))
- {
- zoom_factors.push_back((param->fit_width) / preprocessor.get_max_width());
- }
-
- if(is_positive(param->fit_height))
- {
- zoom_factors.push_back((param->fit_height) / preprocessor.get_max_height());
- }
-
- if(zoom_factors.empty())
- {
- zoom = 1.0;
- }
- else
- {
- zoom = *min_element(zoom_factors.begin(), zoom_factors.end());
- }
-
- text_scale_factor1 = max<double>(zoom, param->font_size_multiplier);
- text_scale_factor2 = zoom / text_scale_factor1;
- }
+ determine_scale_factors(preprocessor.get_max_width(),preprocessor.get_max_height());
// we may output utf8 characters, so always use binary
{
@@ -393,6 +362,49 @@ void HTMLRenderer::post_process()
}
}
+void HTMLRenderer::determine_scale_factors(int width, int height)
+{
+ double zoom = 1.0;
+
+ vector<double> zoom_factors;
+
+ if(is_positive(param->zoom))
+ {
+ zoom_factors.push_back(param->zoom);
+ }
+
+ if(is_positive(param->fit_width))
+ {
+ zoom_factors.push_back((param->fit_width) / width);
+ }
+
+ if(is_positive(param->fit_height))
+ {
+ zoom_factors.push_back((param->fit_height) / height);
+ }
+
+ if(zoom_factors.empty())
+ {
+ zoom = 1.0;
+ }
+ else
+ {
+ zoom = *min_element(zoom_factors.begin(), zoom_factors.end());
+ }
+
+ text_scale_factor1 = max<double>(zoom, param->font_size_multiplier);
+ text_scale_factor2 = zoom / text_scale_factor1;
+}
+
+double HTMLRenderer::text_zoom_factor (int page_number){
+
+ if(is_positive(param->fit_every) && !is_positive(param->zoom)){
Lu Wang Owner

why check param->zoom here

Simanas
Simanas added a note

As we discussed earlier, if zoom is presented, fit_every should not affect anything.

Lu Wang Owner

why is that, it would be confusing.

Simanas
Simanas added a note

If we do not check it here, then if --zoom, --fit-width, --fit-every is presented it would create unwanted results, because in document with different page sizes this method *min_element(zoom_factors.begin(), zoom_factors.end()) on some pages will return zoom value on some calculated zoom value from --fit-width value.

Lu Wang Owner

Currently, if more than one of --zoom, --fit-width/height is specified, the "smallest one" will be used.
"--zoom disables --fit-every" is not intuitive to me, and not even written in the manpage.

Simanas
Simanas added a note

You are wrong. Currently the smallest of --zoom, --fit-width/document_width and --fit-height/document_width is used to get text_scale_factors for every page.

In my modification if you remove param->zoom check, then for every page the smallest of --zoom, --fit-width/page_width and --fit-height/page_width will be used, which would be a stupid behavior. :)

Simanas
Simanas added a note

Yeah... man page should be modificated if you decide to accept this change.

Lu Wang Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ determine_scale_factors(preprocessor.get_page_width(page_number),preprocessor.get_page_height(page_number));
+ }
+ return text_scale_factor1 * text_scale_factor2;
+}
+
+
void HTMLRenderer::set_stream_flags(std::ostream & out)
{
// we output all ID's in hex
1  src/Param.h
View
@@ -27,6 +27,7 @@ struct Param
double zoom;
double fit_width, fit_height;
+ int fit_every;
double h_dpi, v_dpi;
int use_cropbox;
1  src/pdf2htmlEX.cc
View
@@ -65,6 +65,7 @@ void parse_options (int argc, char **argv)
.add("zoom", &param.zoom, 0, "zoom ratio", nullptr, true)
.add("fit-width", &param.fit_width, 0, "fit width", nullptr, true)
.add("fit-height", &param.fit_height, 0, "fit height", nullptr, true)
+ .add("fit-every", &param.fit_every, 0, "fit every page to fit-width/height", nullptr, true)
.add("hdpi", &param.h_dpi, 144.0, "horizontal DPI for non-text")
.add("vdpi", &param.v_dpi, 144.0, "vertical DPI for non-text")
.add("use-cropbox", &param.use_cropbox, 0, "use CropBox instead of MediaBox")
5 src/util/Preprocessor.cc
View
@@ -24,6 +24,7 @@ using std::cerr;
using std::endl;
using std::flush;
using std::max;
+using std::vector;
Preprocessor::Preprocessor(const Param * param)
: OutputDev()
@@ -87,7 +88,9 @@ void Preprocessor::drawChar(GfxState *state, double x, double y,
void Preprocessor::startPage(int pageNum, GfxState *state)
{
max_width = max<double>(max_width, state->getPageWidth());
- max_height = max<double>(max_height, state->getPageHeight());
+ max_height = max<double>(max_height, state->getPageHeight());
+ page_widths[pageNum] = state->getPageWidth();
+ page_heights[pageNum] = state->getPageHeight();
}
const char * Preprocessor::get_code_map (long long font_id) const
6 src/util/Preprocessor.h
View
@@ -20,9 +20,12 @@
#include <PDFDoc.h>
#include <Annot.h>
#include "Param.h"
+#include <map>
namespace pdf2htmlEX {
+using std::map;
+
class Preprocessor : public OutputDev {
public:
Preprocessor(const Param * param);
@@ -45,11 +48,14 @@ class Preprocessor : public OutputDev {
const char * get_code_map (long long font_id) const;
double get_max_width (void) const { return max_width; }
double get_max_height (void) const { return max_height; }
+ double get_page_width (int page_number) { return page_widths[page_number-1]; }
+ double get_page_height (int page_number) { return page_heights[page_number-1]; }
protected:
const Param * param;
double max_width, max_height;
+ map<int,int> page_widths, page_heights;
long long cur_font_id;
char * cur_code_map;
Something went wrong with that request. Please try again.