csvw.dsv fails on Python 2 with encodings that are not 8bit-clean #5

xflr6 · 2017-12-20T10:41:48Z

>>> from csvw.dsv import UnicodeWriter, UnicodeReader
>>> with UnicodeWriter(encoding='utf-16') as writer:
	writer.writerow([u'spam'])
>>> writer.f.seek(0)
>>> with UnicodeReader(writer.f, encoding='utf-16') as reader:
	print next(reader)
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf16' codec can't decode byte 0xb3 in position 2: truncated data
>>> writer.f.getvalue()
'\xff\xfes\r\n'

To allow arbitary encodings with Python 2 csv, cells must be encoded into utf-8 before writing so that the output of the csv.writer can then be recoded into the wanted target encoding. On reading, first recode into utf-8 and then decode the cells from csv.reader (currently, only csvw.dsv.UnicodeReader but not UnicodeWriter does this re-encoding suggested in the csv docs).

As an optimization, the recoding can be skipped for 8bit-clean encodings, cf. csvkit/agate:
https://github.com/wireservice/agate/blob/233afefbc7c0b25084666a2dd2b315b6359a128a/agate/csv_py2.py#L14-L17

The text was updated successfully, but these errors were encountered:

xrotwang · 2017-12-20T11:06:13Z

Limiting the whole library to utf-8 would probably not be a good idea, I guess. Although recoding of files could easily be done elsewhere ...

xflr6 · 2017-12-20T11:07:33Z

+1, AFAIR some versions of excel will produce/want utf-16 for unicode.

xflr6 · 2018-01-18T09:00:40Z

Test case is here:

csvw/tests/test_dsv.py

Lines 93 to 107 in 84b4628

    
           @pytest.mark.parametrize('encoding', [ 
        
               pytest.param('utf-16', marks=pytest.mark.xfail(sys.version_info.major == 2, reason='FIXME: #5', 
        
                                                              raises=UnicodeDecodeError)), 
        
               pytest.param('utf-8-sig', marks=pytest.mark.xfail(sys.version_info.major == 2, reason='FIXME: #5')), 
        
               'utf-8', 
        
           ]) 
        
           def test_roundtrip_multibyte(tmpdir, encoding, row=['spam', 'eggs'], expected='spam,eggs\r\n', n=2): 
        
               filepath = tmpdir / 'spam.csv' 
        
               kwargs = {'encoding': encoding} 
        
               with UnicodeWriter(str(filepath), **kwargs) as writer: 
        
                   writer.writerows([row] * n) 
        
               with UnicodeReader(str(filepath), **kwargs) as reader: 
        
                   result = next(reader) 
        
               assert result == row 
        
               assert filepath.read_binary() == (expected * n).encode(encoding)

- fix #5

xflr6 added the bug label Dec 20, 2017

xrotwang added this to the csvw 1.0 milestone Dec 21, 2017

xflr6 self-assigned this Jan 8, 2018

xflr6 added a commit that referenced this issue Jan 18, 2018

add test case for dsv multibyte encoding issue #5 with xfail

8638094

xflr6 added a commit that referenced this issue Jan 18, 2018

add test case for dsv multibyte encoding issue #5 with xfail

ae07b13

xflr6 added a commit that referenced this issue Jan 18, 2018

add test case for dsv multibyte encoding issue #5 with xfail

84b4628

xflr6 added a commit that referenced this issue Jan 25, 2018

recode via utf-8 for non-8bit-clean encodings on PY2

c25c84e

- fix #5

xflr6 mentioned this issue Jan 25, 2018

recode via utf-8 for non-8bit-clean encodings on PY2 #10

Merged

xflr6 added a commit that referenced this issue Jan 26, 2018

recode via utf-8 for non-8bit-clean encodings on PY2

fa7e0dc

- fix #5

xflr6 added a commit that referenced this issue Jan 26, 2018

recode via utf-8 for non-8bit-clean encodings on PY2

17a7d30

- fix #5

xflr6 closed this as completed in #10 Jan 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csvw.dsv fails on Python 2 with encodings that are not 8bit-clean #5

csvw.dsv fails on Python 2 with encodings that are not 8bit-clean #5

xflr6 commented Dec 20, 2017

xrotwang commented Dec 20, 2017

xflr6 commented Dec 20, 2017

xflr6 commented Jan 18, 2018

csvw.dsv fails on Python 2 with encodings that are not 8bit-clean #5

csvw.dsv fails on Python 2 with encodings that are not 8bit-clean #5

Comments

xflr6 commented Dec 20, 2017

xrotwang commented Dec 20, 2017

xflr6 commented Dec 20, 2017

xflr6 commented Jan 18, 2018