#814 - open compressed files with binary mode #890

spazm · 2014-08-21T05:35:13Z

A possible fix for #814 (introduced with fix for #765?)

We need to open binary files with 'b' flag (and not attempt to translate the encoding) to avoid:

'utf8' codec can't decode byte 0x8b in position 1: invalid start byte

>>> import mimetypes
>>> mimetypes.guess_type('foo.json')
('application/json', None)
>>> mimetypes.guess_type('foo.txt')
('text/plain', None)
>>> mimetypes.guess_type('foo.txt.gz')
('text/plain', 'gzip')
>>> mimetypes.guess_type('foo.txt.tar')
('application/x-tar', None)
>>> mimetypes.guess_type('foo.txt.tar.gz')
('application/x-tar', 'gzip')
>>> mimetypes.guess_type('foo.txt')
('text/plain', None)
>>> mimetypes.guess_type('foo.bar')
(None, None)

for .gzip and .Z files, mimetypes.guess_type will return a non-None encoding.
.tar: 'application/x-tar'
.txt extensions: 'text/*' ctype
for no extension and unknown extension: (None, None)
json will return 'application/json'

Open to suggestions here. We definitely want Non-None encoding to trigger binary treatment. Less sure about the other cases.

mimetypes.guess_type is already used in the s3 code.

coveralls · 2014-08-21T07:06:20Z

Coverage decreased (-0.05%) when pulling e1f20c0 on spazm:fix_814 into c560dfe on aws:develop.

spazm · 2014-08-21T07:33:20Z

python3 tests fail if .json files are opened in binary mode.
Narrowing scope to only use binary flag if encoding is set (e.g. .gz, .Z, .bz2).

coveralls · 2014-08-21T07:34:32Z

Coverage decreased (-0.0%) when pulling 1d7090e on spazm:fix_814 into c560dfe on aws:develop.

coveralls · 2014-08-21T07:37:47Z

Coverage decreased (-0.0%) when pulling 1d7090e on spazm:fix_814 into c560dfe on aws:develop.

jamesls · 2014-08-21T17:58:08Z

What if we make this explicit and let the user tell us if the file should be opened in binary mode or not? Right now we support file://<filename>. What if we had something like fileb://<filename> or file+b://<filename>, etc.?

spazm · 2014-08-21T21:15:34Z

@jamesls thanks for the response!

my quick thoughts:

it used to 'just work' for .gz files (regression).
other tools (e.g. java tools) support both text and compressed files without any extra flags
if we had a flag, it seems like it should be to tag something as text that needs to be translated from the current encoding to utf8?

I like your files-as-blobs approach and to a lesser degree the --file-encoding approach, as you mentioned in #815 (comment)

Encodings can be such a pain to get right! I appreciate your work in trying to come up with a consistent plan.

jamesls · 2014-12-02T02:08:50Z

Fixed via #1010 from @kyleknap's pull request

spazm force-pushed the fix_814 branch 3 times, most recently from 0b03970 to 40541c8 Compare August 21, 2014 05:43

aws#814 - open compressed files with binary mode

1d7090e

spazm force-pushed the fix_814 branch from f259a96 to 1d7090e Compare August 21, 2014 07:29

jamesls added the response-needed label Aug 21, 2014

jamesls mentioned this pull request Nov 18, 2014

s3api --sse-customer-key cannot accept binary data #815

Closed

jamesls closed this Dec 2, 2014

diehlaws added needs-response and removed response-needed labels Jan 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#814 - open compressed files with binary mode #890

#814 - open compressed files with binary mode #890

spazm commented Aug 21, 2014

coveralls commented Aug 21, 2014

spazm commented Aug 21, 2014

coveralls commented Aug 21, 2014

coveralls commented Aug 21, 2014

jamesls commented Aug 21, 2014

spazm commented Aug 21, 2014

jamesls commented Dec 2, 2014

#814 - open compressed files with binary mode #890

#814 - open compressed files with binary mode #890

Conversation

spazm commented Aug 21, 2014

coveralls commented Aug 21, 2014

spazm commented Aug 21, 2014

coveralls commented Aug 21, 2014

coveralls commented Aug 21, 2014

jamesls commented Aug 21, 2014

spazm commented Aug 21, 2014

jamesls commented Dec 2, 2014