Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rotate / flip PDFs? #554

Open
bjeanes opened this issue Jan 8, 2021 · 16 comments
Open

Rotate / flip PDFs? #554

bjeanes opened this issue Jan 8, 2021 · 16 comments
Labels
joex affects the joex component

Comments

@bjeanes
Copy link
Contributor

bjeanes commented Jan 8, 2021

I have a bunch of documents which are upside down (but which I didn't realise until Docspell slurped them in and previewed them).

Have you thought yet about adding the option to rotate a PDF. Eventually, you may even be able to use some heuristics to propose flips/rotations automatically (e.g. if you can't OCR much text, but can after rotating, or by training a model to detect upside down text, etc)

@eikek
Copy link
Owner

eikek commented Jan 9, 2021

Thanks for the suggestion! Yes, I thought about it and it is something I would like to add. But tbh it is not on the top currently, which may change if people want this more than other things…. I would like to be able to remove blank pages and rotate the pdf pages manually. This was one reason for converting everything to pdf, so it can be manipulated later without changing the original. And whatever can be done automatically, should be :-) I would also think that rotation should be possible to automate.

@eresturo
Copy link

Another idea could be automatic page trimming.

Or automatic blank page detection: I experimented with different methods in my scanner script. In the end i just used a simple threshold on the standard deviation of the image pixels: https://github.com/eresturo/scanadf2docspell/blob/5aa3d05c3669c4715db3b24400226d9db42d1c4f/src/preprocessor.py#L29
Works quite well on my documents, Threshold could be configurable and empty pages could only be "hidden" instead of removed.

@bjeanes
Copy link
Contributor Author

bjeanes commented Jan 21, 2021

Yeah that would be nice too. Fortunately my scanner does empty page removal for me so that isn't something I thought about.

@eikek
Copy link
Owner

eikek commented Jan 22, 2021

Yes, that would be nice indeed. A step after pdf conversion could do all this. What I think would be nice too, is to be able to split pdfs based on some stamp or sign that indicates the last page (or a separator page).

@vakilando
Copy link

this is a really good idea, would be cool!

split pdfs based on some stamp or sign that indicates the last page (or a separator page).

@eikek eikek added the joex affects the joex component label Jan 22, 2021
@eikek
Copy link
Owner

eikek commented May 24, 2022

Just a small update: parts of this could now be achieved by using this addon. It is still a feature I would like to have "first class" in docspell, but until this comes the addon is an alternative that can be used right now.

@dariuszszyc
Copy link

I either do something wrong or this alternative doesn't solve the problem.
My issue is that my original document is in an incorrect rotation, hence OCR couldn't really understand the text.

I did use the addon you mentioned, however it rotates only the processed/result pdf, not the original.
When I use re-processing (to get the correct OCR text after rotating it properly) - it still used the incorrectly rotated original.

What I'd like to achieve is rotating the original, the re-processing it.

@eikek
Copy link
Owner

eikek commented Aug 15, 2022

@dariuszszyc hm, the addon should also overwrite the extracted text in docspell so that you can use fulltext search etc. Does this not work (without an additional reprocess)? The original file will never be touched, though. But the "converted" file should be rotated and the extracted text should be updated as well.

@dariuszszyc
Copy link

@eikek didn't work for me. I did few more tests and here are the results:

  1. First I uploaded original document (jpg file with incorrect rotation) - OCR couldn't recognize the text properly

  2. I used the rotate addon - the converted PDF got rotated, but the extracted text didn't change

  3. Also, I made a copy of the jpg file, rotated it with Windows Photos app (then, to ensure, I checked with paint - it was rotated properly) and uploaded. The result was having the "original" jpg file rotated properly, but the converted PDF is rotated incorrectly (as it was originally in point 1).

  4. However, when I took the properly-rotated jpg file from point 3, opened in paint, added a single dot anywhere and uploaded - then both the original file and converted PDF were rotated properly (rotation wasn't changed as it happened in point 3) and OCR properly recognized the text.

@eikek
Copy link
Owner

eikek commented Aug 16, 2022

Thank you for these details, @dariuszszyc . I think point 2 is a bug then, I need to look into it.

Point 3 and 4: When using JPG, it is often the case that the orientation is stored as metadata (kind of) and viewers will either interpret it or not. Some tools won't really rotate the image, but change the orientation setting only. When you edit the image data somehow (when you added a single dot), then the tool is required to store it anew. Could you maybe send me some example jpg file so I can reproduce this?

@eikek
Copy link
Owner

eikek commented Aug 16, 2022

Also maybe we can use a new ticket for this problem here - I just created one docspell/rotate-pdf-addon#1 copying your notes.

@dariuszszyc
Copy link

Thank you for these details, @dariuszszyc . I think point 2 is a bug then, I need to look into it.

Point 3 and 4: When using JPG, it is often the case that the orientation is stored as metadata (kind of) and viewers will either interpret it or not. Some tools won't really rotate the image, but change the orientation setting only. When you edit the image data somehow (when you added a single dot), then the tool is required to store it anew. Could you maybe send me some example jpg file so I can reproduce this?

Forgive me stupid question - not an advanced user - how can I share the jpg with you so it's available only to you (and not visible here) ?

@eikek
Copy link
Owner

eikek commented Aug 16, 2022

Forgive me stupid question - not an advanced user - how can I share the jpg with you so it's available only to you (and not visible here) ?

No worries! You can send me an e-mail or chat me private at matrix (see readme) - ofc if you can just create a new file with some sample content, then you could also post it here.

@dariuszszyc
Copy link

Forgive me stupid question - not an advanced user - how can I share the jpg with you so it's available only to you (and not visible here) ?

No worries! You can send me an e-mail or chat me private at matrix (see readme) - ofc if you can just create a new file with some sample content, then you could also post it here.

Sample files:

  1. Original (incorrect rotation)
    1  orig

  2. Rotated with Windows Photos app (probably changed orientation only in metadata)
    2  rotated

  3. Screenshot of a properly-rotated file (point 2) - it's basically a new file - not just changed metadata.
    3  properly rotated

Results from Docspell below.
Please keep in mind my OCR is set to Polish, therefore you might see some polish characters in extracted content.

  1. Original (incorrect rotation)
    Extracted content
"7noś Jojsnf
JU3JUO2 UJIM Po qaM 9U1JO JOUJ02 Az0I Y

uU9Qq 9UL

Processed PDF (picture of it)
1 orig_processed

  1. Rotated with Windows Photos app (probably changed orientation only in metadata)
    All apps display this picture in a correct orientation, but in docspell I see the original one (doesn't follow the metadata orientation?).

Extracted content

"7noś Jojsnf
JU3JUO2 UJIM Po qaM 9U1JO JOUJ02 Az0I Y

uU9Qq 9UL

Processed PDF (picture of it)
2  rotated_processed

  1. Screenshot of a properly-rotated file (point 2) - it's basically a new file - not just changed metadata.
    Extracted content
The Den

A cozy corner ofthe Web filled with content
justforyou.

Processed PDF (picture of it)
3  properly rotated_processed

@eikek
Copy link
Owner

eikek commented Aug 18, 2022

Thank you @dariuszszyc - I'll test with these.

@madduck
Copy link
Contributor

madduck commented Sep 8, 2023

Linking #1437

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
joex affects the joex component
Projects
None yet
Development

No branches or pull requests

6 participants