Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jpeg vs progressive jpeg #4

Open
6 tasks
ngawangtrinley opened this issue Feb 25, 2020 · 16 comments
Open
6 tasks

jpeg vs progressive jpeg #4

ngawangtrinley opened this issue Feb 25, 2020 · 16 comments

Comments

@ngawangtrinley
Copy link
Collaborator

ngawangtrinley commented Feb 25, 2020

@eroux and @TBRC-JimK, the scanners we use in China don't have Zip compression for Tiff, while LZW is only available for 8 bits.

I believe we want to scan in Tiff 24 bits rather than 8 bits (16 is a no-n0), or do we:

  • Yay
  • Nay

TIFF 24 bits presents the following compression options, which one do we prefer:

  • None # too big for practical handling on the ground
  • Packbits # seems to be a retired algorithm
  • JPEG # lossy, but do we care?
  • Progressive JPEG # lossy too, is the progressive part of any use?

We did analyse the pros and cons of various options and J2K seemed to be the best option at the time, check the documentation here.

@eroux
Copy link
Contributor

eroux commented Feb 25, 2020

24bit total = 8 bit per channel. Is the LZW compression available for it? (it's not super clear in your comment)

For remarkable artefacts, I still think we could use 48bit = 16 bit per channel, but that can be considered optional. 24 bit per channel is quite excessive.

If LZW and Zip are not available, then none. I suppose lossless jpeg2000 is not unreasonable if there's no other option...

@ngawangtrinley
Copy link
Collaborator Author

ngawangtrinley commented Feb 25, 2020

8 bits and 24 bits are two different options, so that must be 24 per channel. 16 isn't on the menu. LZW is not available for 24 bits.

@eroux
Copy link
Contributor

eroux commented Feb 25, 2020

ok, I suppose 8bit = one channel and 24 bit = three channels then. What are the options available outside of tiff?

(also, to answer one of your questions, progressive is better for large images but it's really a detail)

@ngawangtrinley
Copy link
Collaborator Author

ngawangtrinley commented Feb 25, 2020

@jeehuajian will post some screenshots of the setting menus tomorrow. Progressive JPEG is nearly half of JPEG and it visually looks clearer, so if there isn't any issue with progressive we might want to go for it.

@eroux
Copy link
Contributor

eroux commented Feb 25, 2020

ok thanks! half the size raises eyebrows... it's supposed to be only slightly smaller... what's the problem with producing uncompressed tiffs that can the be zip-compressed with xnview?

@jimk-bdrc
Copy link

jimk-bdrc commented Feb 25, 2020 via email

@jimk-bdrc
Copy link

This is a scoping question which may be too late: If you are using scanners, does this mean you are scanning printed material? If so, why not retain the old standard of binary TIFF with LZW compression for the vast majority of the pages, and keep archival quality TIFF for the front and back material, and any other color illustrations? In a 100 page book, it really doesn’t matter if 5 or 10 pages are uncompressed.

@eroux
Copy link
Contributor

eroux commented Feb 25, 2020

I agree, for generally black and white stuff, gray tiff in lzw is the best option. I think we're missing too much information to give a reasonable answer... what problem are we trying to solve? what's the context? what are the limitations or the users, of their machines, of the software, etc.?

@ngawangtrinley
Copy link
Collaborator Author

Scanners are used for both modern prints and pechas. They are used for everything as long as the paper isn't cardboard style. The staff on the ground has cameras but prefers Fujitsu scanners by far. For black and white material they used to scan straight to G4, which we then replaced by j2k as a single lossless color format for all archive images. Web images were then derived into G4 or JPEG depending on the content.

The problem we're trying to solve now is deciding what we replace j2k with. We could go back to 3 formats/compression for color, grayscale and BW. This matters since the scanner and software tutorials will cover the scanning-time settings. The constraints are simplicity, file size, and processing time.

@eroux
Copy link
Contributor

eroux commented Feb 25, 2020

So they discriminate between three cases (color, gray and bitonal), that's interesting... is lzw available in tiff for bitonal or just G4? Would something like j2k for color and lzw for gray and bitonal be simple enough?

@Drongbulobsang
Copy link
Collaborator

Drongbulobsang commented Feb 27, 2020

@jeehuajian will post some screenshots of the setting menus tomorrow. Progressive JPEG is nearly half of JPEG and it visually looks clearer, so if there isn't any issue with progressive we might want to go for it.

Screenshots of the scanner setting

@eroux
Copy link
Contributor

eroux commented Feb 27, 2020

just to be sure, can you send me a j2k that the scanner produces? I want to check if it's lossless or if they encode it in a lossy way... thanks!

@eroux

This comment has been minimized.

@ngawangtrinley
Copy link
Collaborator Author

ngawangtrinley commented Mar 5, 2020

After a few days of intense testing, here's what decided the final winner:
https://github.com/buda-base/digitization-guidelines/wiki/J2K-vs-Tiff-no-compression

The final decision for images produced with Fujitsu scanners is:

  • Unique scan-time format:
    • Tiff, 400 DPI, 24 bits, compression none
  • Unique Archive derivatives:
    • Tiff, 400 DPI, 24 bits, compression ZIP
  • Color content Web derivatives:
    • Progressive JPEG, resized, 24 bits, compression 60
  • BW content Web derivatives:
    • Tiff, resized, binary, compression G4

Resizing is based on ། size, (for OCR min char height is 20 pixels, optimal is 40 pixels):

  • Normal content (common/modern prints):
    • 40 pixels །
  • Special content (manuscripts etc):
    • 60 pixel །

། height are measured on the archive images and they inform the resizing % ratio:

image
Here the height is 100 pixels, which means that the resizing ratio should be 60%.

@ngawangtrinley
Copy link
Collaborator Author

ngawangtrinley commented Mar 5, 2020 via email

@ngawangtrinley
Copy link
Collaborator Author

An easy way to make a derivation script would be to ask field staff to add a suffix to images that need to be converted to color, something like image123x.tif for image123.tif. With this and a command line interface script that takes in the source images path + the ། height in pixels we would be good to go.

(I know, we should have figured this out 10 years ago)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants