Rotate / flip PDFs? #554

bjeanes · 2021-01-08T23:44:13Z

I have a bunch of documents which are upside down (but which I didn't realise until Docspell slurped them in and previewed them).

Have you thought yet about adding the option to rotate a PDF. Eventually, you may even be able to use some heuristics to propose flips/rotations automatically (e.g. if you can't OCR much text, but can after rotating, or by training a model to detect upside down text, etc)

eikek · 2021-01-09T00:36:29Z

Thanks for the suggestion! Yes, I thought about it and it is something I would like to add. But tbh it is not on the top currently, which may change if people want this more than other things…. I would like to be able to remove blank pages and rotate the pdf pages manually. This was one reason for converting everything to pdf, so it can be manipulated later without changing the original. And whatever can be done automatically, should be :-) I would also think that rotation should be possible to automate.

eresturo · 2021-01-21T22:39:14Z

Another idea could be automatic page trimming.

Or automatic blank page detection: I experimented with different methods in my scanner script. In the end i just used a simple threshold on the standard deviation of the image pixels: https://github.com/eresturo/scanadf2docspell/blob/5aa3d05c3669c4715db3b24400226d9db42d1c4f/src/preprocessor.py#L29
Works quite well on my documents, Threshold could be configurable and empty pages could only be "hidden" instead of removed.

bjeanes · 2021-01-21T22:53:36Z

Yeah that would be nice too. Fortunately my scanner does empty page removal for me so that isn't something I thought about.

eikek · 2021-01-22T00:16:06Z

Yes, that would be nice indeed. A step after pdf conversion could do all this. What I think would be nice too, is to be able to split pdfs based on some stamp or sign that indicates the last page (or a separator page).

vakilando · 2021-01-22T00:21:39Z

this is a really good idea, would be cool!

split pdfs based on some stamp or sign that indicates the last page (or a separator page).

eikek · 2022-05-24T21:21:39Z

Just a small update: parts of this could now be achieved by using this addon. It is still a feature I would like to have "first class" in docspell, but until this comes the addon is an alternative that can be used right now.

dariuszszyc · 2022-08-15T16:07:22Z

I either do something wrong or this alternative doesn't solve the problem.
My issue is that my original document is in an incorrect rotation, hence OCR couldn't really understand the text.

I did use the addon you mentioned, however it rotates only the processed/result pdf, not the original.
When I use re-processing (to get the correct OCR text after rotating it properly) - it still used the incorrectly rotated original.

What I'd like to achieve is rotating the original, the re-processing it.

eikek · 2022-08-15T18:51:20Z

@dariuszszyc hm, the addon should also overwrite the extracted text in docspell so that you can use fulltext search etc. Does this not work (without an additional reprocess)? The original file will never be touched, though. But the "converted" file should be rotated and the extracted text should be updated as well.

dariuszszyc · 2022-08-15T19:09:58Z

@eikek didn't work for me. I did few more tests and here are the results:

First I uploaded original document (jpg file with incorrect rotation) - OCR couldn't recognize the text properly
I used the rotate addon - the converted PDF got rotated, but the extracted text didn't change
Also, I made a copy of the jpg file, rotated it with Windows Photos app (then, to ensure, I checked with paint - it was rotated properly) and uploaded. The result was having the "original" jpg file rotated properly, but the converted PDF is rotated incorrectly (as it was originally in point 1).
However, when I took the properly-rotated jpg file from point 3, opened in paint, added a single dot anywhere and uploaded - then both the original file and converted PDF were rotated properly (rotation wasn't changed as it happened in point 3) and OCR properly recognized the text.

eikek · 2022-08-16T20:41:16Z

Thank you for these details, @dariuszszyc . I think point 2 is a bug then, I need to look into it.

Point 3 and 4: When using JPG, it is often the case that the orientation is stored as metadata (kind of) and viewers will either interpret it or not. Some tools won't really rotate the image, but change the orientation setting only. When you edit the image data somehow (when you added a single dot), then the tool is required to store it anew. Could you maybe send me some example jpg file so I can reproduce this?

eikek · 2022-08-16T20:42:56Z

Also maybe we can use a new ticket for this problem here - I just created one docspell/rotate-pdf-addon#1 copying your notes.

dariuszszyc · 2022-08-16T21:31:46Z

Thank you for these details, @dariuszszyc . I think point 2 is a bug then, I need to look into it.

Point 3 and 4: When using JPG, it is often the case that the orientation is stored as metadata (kind of) and viewers will either interpret it or not. Some tools won't really rotate the image, but change the orientation setting only. When you edit the image data somehow (when you added a single dot), then the tool is required to store it anew. Could you maybe send me some example jpg file so I can reproduce this?

Forgive me stupid question - not an advanced user - how can I share the jpg with you so it's available only to you (and not visible here) ?

eikek · 2022-08-16T23:27:18Z

Forgive me stupid question - not an advanced user - how can I share the jpg with you so it's available only to you (and not visible here) ?

No worries! You can send me an e-mail or chat me private at matrix (see readme) - ofc if you can just create a new file with some sample content, then you could also post it here.

dariuszszyc · 2022-08-17T07:23:21Z

Forgive me stupid question - not an advanced user - how can I share the jpg with you so it's available only to you (and not visible here) ?

No worries! You can send me an e-mail or chat me private at matrix (see readme) - ofc if you can just create a new file with some sample content, then you could also post it here.

Sample files:

Original (incorrect rotation)
Rotated with Windows Photos app (probably changed orientation only in metadata)
Screenshot of a properly-rotated file (point 2) - it's basically a new file - not just changed metadata.

Results from Docspell below.
Please keep in mind my OCR is set to Polish, therefore you might see some polish characters in extracted content.

Original (incorrect rotation)
Extracted content

"7noś Jojsnf
JU3JUO2 UJIM Po qaM 9U1JO JOUJ02 Az0I Y

uU9Qq 9UL

Processed PDF (picture of it)

Rotated with Windows Photos app (probably changed orientation only in metadata)
All apps display this picture in a correct orientation, but in docspell I see the original one (doesn't follow the metadata orientation?).

Extracted content

"7noś Jojsnf
JU3JUO2 UJIM Po qaM 9U1JO JOUJ02 Az0I Y

uU9Qq 9UL

Processed PDF (picture of it)

Screenshot of a properly-rotated file (point 2) - it's basically a new file - not just changed metadata.
Extracted content

The Den

A cozy corner ofthe Web filled with content
justforyou.

Processed PDF (picture of it)

eikek · 2022-08-18T11:04:44Z

Thank you @dariuszszyc - I'll test with these.

madduck · 2023-09-08T22:04:58Z

Linking #1437

eikek added the joex affects the joex component label Jan 22, 2021

eikek mentioned this issue Jul 3, 2021

Feature Request: "Try to get the orientation right" #912

Open

eikek mentioned this issue Aug 16, 2022

After rotation, text is not extracted docspell/rotate-pdf-addon#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rotate / flip PDFs? #554

Rotate / flip PDFs? #554

bjeanes commented Jan 8, 2021

eikek commented Jan 9, 2021

eresturo commented Jan 21, 2021

bjeanes commented Jan 21, 2021

eikek commented Jan 22, 2021

vakilando commented Jan 22, 2021

eikek commented May 24, 2022

dariuszszyc commented Aug 15, 2022

eikek commented Aug 15, 2022

dariuszszyc commented Aug 15, 2022

eikek commented Aug 16, 2022 •

edited

Loading

eikek commented Aug 16, 2022 •

edited

Loading

dariuszszyc commented Aug 16, 2022

eikek commented Aug 16, 2022

dariuszszyc commented Aug 17, 2022

eikek commented Aug 18, 2022

madduck commented Sep 8, 2023

Rotate / flip PDFs? #554

Rotate / flip PDFs? #554

Comments

bjeanes commented Jan 8, 2021

eikek commented Jan 9, 2021

eresturo commented Jan 21, 2021

bjeanes commented Jan 21, 2021

eikek commented Jan 22, 2021

vakilando commented Jan 22, 2021

eikek commented May 24, 2022

dariuszszyc commented Aug 15, 2022

eikek commented Aug 15, 2022

dariuszszyc commented Aug 15, 2022

eikek commented Aug 16, 2022 • edited Loading

eikek commented Aug 16, 2022 • edited Loading

dariuszszyc commented Aug 16, 2022

eikek commented Aug 16, 2022

dariuszszyc commented Aug 17, 2022

eikek commented Aug 18, 2022

madduck commented Sep 8, 2023

eikek commented Aug 16, 2022 •

edited

Loading

eikek commented Aug 16, 2022 •

edited

Loading