raw and parsed decode compatibility with ZXing #17

bondeje · 2021-08-16T02:18:55Z

This is probably related to issue #6 and tangentially related to issue #8 ; it might also just be closed with statement on the use case/spec compatibility of this wrapper with ZXing.

With ZXing, the default outputs of the raw stream are hex bytes while the parsed/text appears to defaul to just Java String (UTF-16). However, the python-zxing wrapper has these two lines in __init__.py/Barcode:

python-zxing/zxing/__init__.py

Lines 143 to 144 in 938ce86

    
           raw = raw[:-1].decode() 
        
           parsed = parsed[:-1].decode()

which force UTF-8 compatibility of the raw and parsed streams. This causes UnicodeDecodeError exceptions with any byte sequences that begin with bytes "larger" than b"\x7f" with which ZXing doesn't appear to have any problems. Just commenting out '.decode()' in each of these lines, I am able to reproduce ZXing results for byte data in both raw and parsed.

My question: is there a reason the '.decode()' or UTF-8 is even necessary for either raw or parsed or is the use case for this wrapper intended to be for UTF-8 compatible QR codes only?

The text was updated successfully, but these errors were encountered:

dlenski · 2021-08-16T18:31:44Z

which force UTF-8 compatibility of the raw and parsed streams. This causes UnicodeDecodeError exceptions with any byte sequences that begin with bytes "larger" than b"\x7f" with which ZXing doesn't appear to have any problems. Just commenting out '.decode()' in each of these lines, I am able to reproduce ZXing results for byte data in both raw and parsed.

Can you give a specific example of a barcode where the ZXing Java library produces a non-UTF8-compatible raw and/or parsed output, and how you think that barcode should be interpreted?

My question: is there a reason the '.decode()' or UTF-8 is even necessary for either raw or parsed or is the use case for this wrapper intended to be for UTF-8 compatible QR codes only?

It would be nice to make it work for as many cases as possible. I'm a bit leery about changing the current/existing behavior (return a Python str) to returning bytes instead, however.

bondeje · 2021-08-19T06:15:03Z

Below is the QR code for the binary packed integer 128: b'\x80'.

For the current UTF-8 decoding, i.e.

python-zxing/zxing/__init__.py

Lines 143 to 144 in 938ce86

    
           raw = raw[:-1].decode() 
        
           parsed = parsed[:-1].decode()

it results in:

But removing the UTF-8 decoding, i.e.

raw = raw[:-1]#.decode()
parsed = parsed[:-1]#.decode()

results in what I am expecting:

I emphasize I am expecting because I'm not sure if the "raw" here is intended to be just the "raw text" or the appropriate portion of "raw bytes" as seen in the https://zxing.org/w/decode output for the same QR code. It seems that the zxing CommandLineRunner is able to return the "raw bytes" when subprocess is using the None encoding as it currently is. I understand if this is not supported because trying to get all three of the "raw text", part of "raw bytes", and "parsed result" from the subprocess (with only options of all bytes or all string with specified encoding) is probably ugly usage of try/catch and attempting to guess whatever decoding/parsing is intended.

I am not doing anything special with reading the QR code:

import zxing

if __name__ == '__main__':
    img_file = './python-zxing_tests/Int_128_packed_binary.png' # the attached QR code
    print(zxing.BarCodeReader().decode(img_file))

dlenski · 2021-08-25T17:23:53Z

Below is the QR code for the binary packed integer 128: b'\x80'. https://user-images.githubusercontent.com/88994019/130013442-60545f8d-432d-491d-b3e4-37891cde13cc.png

The QR code symbology does not provide any mechanism to encode "binary" barcodes. The only well-defined use of it is to encode strings in various character encodings.

The default interpretation of the bytes in a QR code (same for PDF417, Aztec code, DataMatrix) is ISO-8859-1; any barcode that intends another character encoding must declare it using ECI codes in order to be interpreted correctly.

(See https://stackoverflow.com/questions/27857718/aztec-barcode-vs-qr-code/64585749 for some work I've done in this area.)

So:

There's no well-defined way to signal that a QR code should be interpreted as "binary". ISO-8859-1 sorta-kinda works as a binary fallback, because there are no multi-byte character sequences, so tolerant encoders/decoders bytes can reversibly decode to ISO-8859-1 and then re-encode including unknown bytes.
The ZXing command-line-runner mangles the output of raw bytes beyond recognition on some operating systems, if they can't be correctly interpreted as UTF-8. For example, the QR code you give as an example gets completely borked on Linux. (I'm guessing you're testing on Windows?) See aff3dde where I added your file as an example, along with some other possible changes:

======================================================================
FAIL: test_all.test_decoding('QR_CODE-binary-80.png', 'QR_CODE', b'\x80')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/dlenski/floss/python-zxing/test/test_all.py", line 50, in _check_decoding
    raise AssertionError('Expected {!r} but got {!r}'.format(expected_raw, dec.raw_bytes))
AssertionError: Expected b'\x80' but got b'\xef\xbf\xbd'

In order to improve this situation, the ZXing command-line runner would have to be improved to not mangle unknown bytes.

My question: is there a reason the '.decode()' or UTF-8 is even necessary for either raw or parsed or is the use case for this wrapper intended to be for UTF-8 compatible QR codes only?

I guess the "tl;dr" here is that there's no way to use the ZXing command-line runner, in its current form, to decode barcodes whose contents don't map to a sequence of UTF-8 characters.

bondeje · 2021-08-25T19:38:44Z

Yes, my tests were on Windows.

Thanks for the detailed response. I appreciate all the information.

dlenski · 2021-08-25T20:25:52Z

Yes, my tests were on Windows.

Good to know, thanks! Interesting that ZXing CLI mangles the output less on Windows. It'd be good to get a stable encoding-proof output method upstream.

I lack the bandwidth to work on this now, but would be happy to review an upstream PR if you want to contribute it. :-D

dlenski added a commit that referenced this issue Aug 25, 2021

mangled 'binary' barcode (see #17)

aff3dde

bondeje closed this as completed Aug 25, 2021

dlenski mentioned this issue Aug 30, 2021

Use jpype for wrapper? #19

Closed

dlenski mentioned this issue Nov 12, 2021

how to get rawByte from qr code? #20

Closed

dlenski mentioned this issue Feb 7, 2022

utf-8 codec decoding error #22

Open

dlenski added a commit that referenced this issue Feb 7, 2022

mangled 'binary' barcode (see #17)

4dc8648

Repository owner locked and limited conversation to collaborators Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raw and parsed decode compatibility with ZXing #17

raw and parsed decode compatibility with ZXing #17

bondeje commented Aug 16, 2021 •

edited by dlenski

Loading

dlenski commented Aug 16, 2021

bondeje commented Aug 19, 2021

dlenski commented Aug 25, 2021 •

edited

Loading

bondeje commented Aug 25, 2021 •

edited

Loading

dlenski commented Aug 25, 2021 •

edited

Loading

raw and parsed decode compatibility with ZXing #17

raw and parsed decode compatibility with ZXing #17

Comments

bondeje commented Aug 16, 2021 • edited by dlenski Loading

dlenski commented Aug 16, 2021

bondeje commented Aug 19, 2021

dlenski commented Aug 25, 2021 • edited Loading

bondeje commented Aug 25, 2021 • edited Loading

dlenski commented Aug 25, 2021 • edited Loading

bondeje commented Aug 16, 2021 •

edited by dlenski

Loading

dlenski commented Aug 25, 2021 •

edited

Loading

bondeje commented Aug 25, 2021 •

edited

Loading

dlenski commented Aug 25, 2021 •

edited

Loading