Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raw and parsed decode compatibility with ZXing #17

Closed
bondeje opened this issue Aug 16, 2021 · 5 comments
Closed

raw and parsed decode compatibility with ZXing #17

bondeje opened this issue Aug 16, 2021 · 5 comments

Comments

@bondeje
Copy link

bondeje commented Aug 16, 2021

This is probably related to issue #6 and tangentially related to issue #8 ; it might also just be closed with statement on the use case/spec compatibility of this wrapper with ZXing.

With ZXing, the default outputs of the raw stream are hex bytes while the parsed/text appears to defaul to just Java String (UTF-16). However, the python-zxing wrapper has these two lines in __init__.py/Barcode:

raw = raw[:-1].decode()
parsed = parsed[:-1].decode()

which force UTF-8 compatibility of the raw and parsed streams. This causes UnicodeDecodeError exceptions with any byte sequences that begin with bytes "larger" than b"\x7f" with which ZXing doesn't appear to have any problems. Just commenting out '.decode()' in each of these lines, I am able to reproduce ZXing results for byte data in both raw and parsed.

My question: is there a reason the '.decode()' or UTF-8 is even necessary for either raw or parsed or is the use case for this wrapper intended to be for UTF-8 compatible QR codes only?

@dlenski
Copy link
Owner

dlenski commented Aug 16, 2021

which force UTF-8 compatibility of the raw and parsed streams. This causes UnicodeDecodeError exceptions with any byte sequences that begin with bytes "larger" than b"\x7f" with which ZXing doesn't appear to have any problems. Just commenting out '.decode()' in each of these lines, I am able to reproduce ZXing results for byte data in both raw and parsed.

Can you give a specific example of a barcode where the ZXing Java library produces a non-UTF8-compatible raw and/or parsed output, and how you think that barcode should be interpreted?

My question: is there a reason the '.decode()' or UTF-8 is even necessary for either raw or parsed or is the use case for this wrapper intended to be for UTF-8 compatible QR codes only?

It would be nice to make it work for as many cases as possible. I'm a bit leery about changing the current/existing behavior (return a Python str) to returning bytes instead, however.

@bondeje
Copy link
Author

bondeje commented Aug 19, 2021

Below is the QR code for the binary packed integer 128: b'\x80'.
Int_128_packed_binary

For the current UTF-8 decoding, i.e.

raw = raw[:-1].decode()
parsed = parsed[:-1].decode()

it results in:
image

But removing the UTF-8 decoding, i.e.

raw = raw[:-1]#.decode()
parsed = parsed[:-1]#.decode() 

results in what I am expecting:
image

I emphasize I am expecting because I'm not sure if the "raw" here is intended to be just the "raw text" or the appropriate portion of "raw bytes" as seen in the https://zxing.org/w/decode output for the same QR code. It seems that the zxing CommandLineRunner is able to return the "raw bytes" when subprocess is using the None encoding as it currently is. I understand if this is not supported because trying to get all three of the "raw text", part of "raw bytes", and "parsed result" from the subprocess (with only options of all bytes or all string with specified encoding) is probably ugly usage of try/catch and attempting to guess whatever decoding/parsing is intended.

image

I am not doing anything special with reading the QR code:

import zxing

if __name__ == '__main__':
    img_file = './python-zxing_tests/Int_128_packed_binary.png' # the attached QR code
    print(zxing.BarCodeReader().decode(img_file))

dlenski added a commit that referenced this issue Aug 25, 2021
@dlenski
Copy link
Owner

dlenski commented Aug 25, 2021

Below is the QR code for the binary packed integer 128: b'\x80'. https://user-images.githubusercontent.com/88994019/130013442-60545f8d-432d-491d-b3e4-37891cde13cc.png

The QR code symbology does not provide any mechanism to encode "binary" barcodes. The only well-defined use of it is to encode strings in various character encodings.

The default interpretation of the bytes in a QR code (same for PDF417, Aztec code, DataMatrix) is ISO-8859-1; any barcode that intends another character encoding must declare it using ECI codes in order to be interpreted correctly.

(See https://stackoverflow.com/questions/27857718/aztec-barcode-vs-qr-code/64585749 for some work I've done in this area.)

So:

  1. There's no well-defined way to signal that a QR code should be interpreted as "binary". ISO-8859-1 sorta-kinda works as a binary fallback, because there are no multi-byte character sequences, so tolerant encoders/decoders bytes can reversibly decode to ISO-8859-1 and then re-encode including unknown bytes.
  2. The ZXing command-line-runner mangles the output of raw bytes beyond recognition on some operating systems, if they can't be correctly interpreted as UTF-8. For example, the QR code you give as an example gets completely borked on Linux. (I'm guessing you're testing on Windows?) See aff3dde where I added your file as an example, along with some other possible changes:
======================================================================
FAIL: test_all.test_decoding('QR_CODE-binary-80.png', 'QR_CODE', b'\x80')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/dlenski/floss/python-zxing/test/test_all.py", line 50, in _check_decoding
    raise AssertionError('Expected {!r} but got {!r}'.format(expected_raw, dec.raw_bytes))
AssertionError: Expected b'\x80' but got b'\xef\xbf\xbd'
  1. In order to improve this situation, the ZXing command-line runner would have to be improved to not mangle unknown bytes.

My question: is there a reason the '.decode()' or UTF-8 is even necessary for either raw or parsed or is the use case for this wrapper intended to be for UTF-8 compatible QR codes only?

I guess the "tl;dr" here is that there's no way to use the ZXing command-line runner, in its current form, to decode barcodes whose contents don't map to a sequence of UTF-8 characters.

@bondeje
Copy link
Author

bondeje commented Aug 25, 2021

Yes, my tests were on Windows.

Thanks for the detailed response. I appreciate all the information.

@bondeje bondeje closed this as completed Aug 25, 2021
@dlenski
Copy link
Owner

dlenski commented Aug 25, 2021

Yes, my tests were on Windows.

Good to know, thanks! Interesting that ZXing CLI mangles the output less on Windows. It'd be good to get a stable encoding-proof output method upstream.

I lack the bandwidth to work on this now, but would be happy to review an upstream PR if you want to contribute it. :-D

dlenski added a commit that referenced this issue Feb 7, 2022
Repository owner locked and limited conversation to collaborators Aug 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants