Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior when dealing with alpha channels and tiffs #495

Closed
bmoscon opened this issue Sep 1, 2020 · 16 comments
Closed

Inconsistent behavior when dealing with alpha channels and tiffs #495

bmoscon opened this issue Sep 1, 2020 · 16 comments
Labels
doc upstream-issue Requires a functionality / behavior change outside of this python project
Milestone

Comments

@bmoscon
Copy link

bmoscon commented Sep 1, 2020

This is related to #468 so if you want to close this and re-open that, that's fine by me.

I'm using wand 0.6.2, and when I use the PDF still attached to #468 , and run that example code I posted, (I tried both setting alpha_channel to off and False) I get something far weirder (see attached). The original fix that was posted to stackoverflow also no longer works (but I think that's expected because you now handle it internally).

Some sample code:

This works:

with Image(filename='5-page-pdf.pdf') as img:
    img.type = "grayscale"
    img.format = "tiff"
    img.compression = "lzw"
    img.alpha_channel = 'off'
    img.save(filename="test.tiff")

This does not:

with Image(filename='5-page-pdf.pdf') as img:
    img.type = "grayscale"
    img.format = "tiff"
    img.compression = "lzw"
    img.alpha_channel = False
    img.save(filename="test.tiff")

I've noticed that some TIFF viewers enhance images more than others, the Preview application on macOS is pretty barebones and doesn't do anything to make the TIFF nicer, so the difference between the working and non working examples is that the non working example only has the first and last page with the alpha channel correctly set to off. (When Windows loads the TIFF using Windows Photo Viewer, it looks correct, but I believe its not showing the raw tiff). Not a major deal, I can just use "off", I just expected False to work based on the docs.

This also does not work, but the result is far weirder and likely a bug (TIFF output also attached):

blob = Image(filename='5-page-pdf.pdf').make_blob()

with Image(blob=blob) as img:
    img.type = "grayscale"
    img.format = "tiff"
    img.compression = "lzw"
    img.alpha_channel = 'off'
    img.save(filename="test.tiff")

As you can see, the alpha channel is removed, but each page is scaled down and repeated 3 times horizontally on its respective page. Very bizarre

The tiff showing the difference between "off" and False. (The version using off looks correct so I didnt bother posting it).

The tiff showing the result of loading an image as a blob and then trying to remove the alpha channel

@bmoscon
Copy link
Author

bmoscon commented Sep 1, 2020

@emcconville I'm happy to help try and debug this, but not sure where to start. My current hypothesis is that some information is lost either when converted to blob or when the Image object is instantiated from blob...

@emcconville
Copy link
Owner

Happy to have your help. I haven't had time to review the test cases, but "off" & False was aliased to to "deactivate" for ImageMagick-6, and in ImageMagick-7 "off" has it's own behavior with False still aliased to "deactivate".

@bmoscon
Copy link
Author

bmoscon commented Sep 1, 2020

ok, good to know. I was testing with imagemagick7, so let me repeat the tests with 6 to see what happens.

@bmoscon
Copy link
Author

bmoscon commented Sep 1, 2020

ok, using version 6, the difference between 'off' and False is gone, both work as expected. It seems it only differs using imageMagick7. So I guess that's solved, or maybe just needs a mention about behavior difference in the docs.

The 2nd issue, which I think is far more concerning, still persists while using version 6. For some reason opening, serializing to bytes, and then re-constituting from bytes causes issues.

@bmoscon
Copy link
Author

bmoscon commented Sep 1, 2020

Interestingly, I can simplify the test case to this to reproduce the 2nd issue (meaning its not alpha channel related):

blob = Image(filename='5-page-pdf.pdf').make_blob()

with Image(blob=blob) as img:
    img.format = "tiff"
    img.save(filename="test.tiff")

So it seems like its either related to to/from bytes, or from-PDF / to-TIFF ?

@emcconville
Copy link
Owner

I couple of thoughts & tips:

  1. What's the point of the blob = Image(filename='5-page-pdf.pdf').make_blob() line? ImageMagick will call ghostscript to generate a raster, and the .make_blob() embed the raster image back into a new PDF. Fine for a test case, but the resulting blob would be a low resolution + a generation loss. A basic with open('5-page-pdf.pdf', 'rb') as fd: blob = fd.read() would be faster & none destructive.
  2. With the Image constructor, be sure to pass format='PDF', and resolution=100 (or higher). The former ensures the right decoder is used as some blobs are ambiguous, and the latter as the default resolution is very low.
    with Image(blob=blob, format='PDF', resolution=100) as img:
        ....
  3. You can drop the img.format = "tiff" as ImageMagick has an order of operations as: (1) protocol (the tiff: prefix in the filename), (2) the filename extension, (3) the image format property, (4) the image info property (original image read format). In your example, the filename's suffix will be respected before any value set to the img.format property.

Just some tidbits until I can gather some time to review the test cases.

@bmoscon
Copy link
Author

bmoscon commented Sep 2, 2020

  1. Its just a simplified example that best represents my use case - which is files comings from object store in the cloud - the API gives me a byte stream. So, while I'm not really doing blob = Image(filename='5-page-pdf.pdf').make_blob() I am given raw bytes and writing raw bytes back. I write raw bytes with make_blob() (and then call the respective API to write them back to the storage bucket). Files on disk are not really worked with, so even the .save(filename....) is just sample code.
  2. Same goes for this. I don't have resolutions or format='PDF' because the image files I'm working with could be anything (PDF included). There are no file extensions on the blobs in the cloud either and I don't want to reformat any resolutions (unless I need to resample due to non-square pixels).
  3. Same deal here, I'm not actually writing files to disk, so my real use case is to set the image format to TIFF and then export to a bytestream for writing back to cloud storage.

Backtracking to 2, I'm guessing I can open a file with Wand to determine the image type, and then open it again passing in the format? Would this improve it at all? It seems like it wouldn't help because if I use Wand to determine the filetype, I wouldn't be able to supply any new information by passing in the format argument a 2nd time. If the format is especially important, is it worth it for me to do something to preprocess the blobs in some other way to determine the format and have that info in metadata somewhere for use when converting to TIFF?

@bmoscon
Copy link
Author

bmoscon commented Sep 2, 2020

also, tomorrow I'll test by opening the file with open and see if that affects the results at all - may help in narrowing down the root cause

@bmoscon
Copy link
Author

bmoscon commented Sep 2, 2020

my main issue seems most related to #195 now that I've looked into some older tickets

@bmoscon
Copy link
Author

bmoscon commented Sep 2, 2020

this

blob = open('5-page-pdf.pdf', 'rb').read()

with Image(blob=blob) as img:
    img.format = "tiff"
    img.save(filename="test.tiff")

works, which leads me to believe its related to make_blob(), because opening the file and saving it works fine (without specifying format or anything else). Not sure how save would differ greatly from make_blob - honestly I'd expect save to call make_blob for serialization and then save would just handle disk write

@emcconville
Copy link
Owner

Sounds right. This would hint at the generation loss by raster(ing) a PDF. The make_blob() for PDF's will generate a PDF of an image of a PDF.

Starting to get around to the original test cases. I can recreate the original off / False issues between ImageMagick versions with Wand & CLI by running the variations of ...

 cat 5-page-pdf.pdf | convert - -alpha off -compress lzw -type grayscale -format tiff test.tiff

For the previous #195 issue, that;s a known limitation with working blob or file descriptor. Wand just sits on top of the MagickWand library, and it'll inherit any shortcomings therefrom. Officially, users should specify the decoder when passing blobs.

is it worth it for me to do something to preprocess the blobs in some other way to determine the format

Kinda, but this is more in line with ImageTragick, and best practices. When working with blobs blindly, it's important to check the first few bytes to ensure the right decoder is used. For PDFs; that as simple as..

format = None
if blob[0:5] = b'%PDF-':
    format = 'PDF'
with Image(blob=blob, format=format) as img:
   ....

And of course, reject & error handle any unknown payloads.

@emcconville
Copy link
Owner

For the second issue. It appears to be introduced with blob = Image(filename='5-page-pdf.pdf).make_blob(). The resulting blob has the following error (per page).

   **** Error: ICCbased space /N value does not match the ICC profile.
                 Using the number of channels from the profile.
                 Output may be incorrect.

Which is coming from ghostscript.

@emcconville
Copy link
Owner

.. and can recreate in both wand & CLI directly

magick 5-page-pdf.pdf - | magick - -type grayscale -compress lzw -alpha off test.tiff

This tells me we don't have a Wand issue, but limitations / bug of ImageMagick / ghostscript.

FYI, I'm on fedora, and testing with ImageMagick 7.0.10-28 & 6.9.11-27. Both using ghostscript 9.52

@bmoscon
Copy link
Author

bmoscon commented Sep 2, 2020

@emcconville ok, seems like the only change for wand would be to, at most, note the difference between off and false for alpha channels between imagemagic6 and 7, but I can also see the case where one could argue that wand should document wand behavior, not imagemagick behavior. The other issues seem like imagemagick bugs(?). Is it worth my time to file them against imagemagick? You reproduced them with the CLI, not me, so if you'd like to file them be my guest, otherwise I'd probably just copy/paste your examples :D

I'm going to close this unless you think there is some reason it should remain open.

@emcconville
Copy link
Owner

I'll follow up with IM team, but need some more time to dig in. Go ahead and keep this issue open. I'll close it with the doc updates, and Wand 0.6.3 is ready to ship.

@emcconville
Copy link
Owner

emcconville commented Sep 15, 2020

Following up on this. Looks like the PDF files were the cause of the issue. Specifically the ImageType was set to grayscale, but the ICC profile was RGB. In previous version of GhostScript, the ICC was ignored. Current version of GhostScript will respect the ICC profile, and warn / error to stderr. Issue #485 talks about suppressing the error messages from delegates, but I believe in your use-case, you want to capture and parse for PDF messages.

Closing for now as this is not exactly a Wand issue (although capturing & releasing child process buffers would be a need feature).

@emcconville emcconville added this to the Wand 0.6.3 milestone Sep 15, 2020
@emcconville emcconville added doc upstream-issue Requires a functionality / behavior change outside of this python project labels Sep 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc upstream-issue Requires a functionality / behavior change outside of this python project
Projects
None yet
Development

No branches or pull requests

2 participants