Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding issue with utf-8 #71

Closed
tranvictor opened this issue Aug 9, 2016 · 13 comments
Closed

Decoding issue with utf-8 #71

tranvictor opened this issue Aug 9, 2016 · 13 comments

Comments

@tranvictor
Copy link

As I can see in dropbox/dropbox.py, line 393 and so on, you always decode the resp using utf-8 which I don't think it works properly in case the user has a non-utf8 file name.

Should we use chardet to decode the resp to reduce error?

@cakoose
Copy link

cakoose commented Aug 9, 2016

  • File names in Dropbox only use Unicode characters.
  • UTF-8 can represent all Unicode characters.
  • Dropbox API JSON responses are always valid UTF-8.

Are you running into any specific issues?

@tranvictor
Copy link
Author

Yes, I ran into exactly issue that I have described above. This is the first time I played with Dropbox API so I ran this snippet:

import dropbox

dbx = dropbox.Dropbox('hidden')
print dbx.users_get_current_account()
for entry in dbx.files_list_folder('', recursive = True).entries:
    print entry.name

Then I just got:

Traceback (most recent call last):
  File "restore.py", line 5, in <module>
    for entry in dbx.files_list_folder('', recursive = True).entries:
  File "/Library/Python/2.7/site-packages/dropbox/base.py", line 538, in files_list_folder
    None,
  File "/Library/Python/2.7/site-packages/dropbox/dropbox.py", line 214, in request
    request_binary)
  File "/Library/Python/2.7/site-packages/dropbox/dropbox.py", line 297, in request_json_string_with_retry
    request_binary)
  File "/Library/Python/2.7/site-packages/dropbox/dropbox.py", line 393, in request_json_string
    raw_resp = r.content.decode('utf-8')
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte

In my Dropbox, I have a directory named "Tổ dân phố" and I think this string caused the issue. Are you sure that File names in Dropbox are all in UTF-8? What if the user used different encoding standard? The name is still unicode but under different encoding?

@cakoose
Copy link

cakoose commented Aug 9, 2016

It looks like the response you're getting is not valid UTF-8. That should never happen.

Can you add the following line to "dropbox/dropbox.py", right before line 393:

print "r.content =", repr(r.content)[:100]

About Unicode/UTF-8

All Dropbox file name characters are definitely a subset of the Unicode character set. It doesn't actually matter what encoding Dropbox uses to store the file names; when the Dropbox API returns a JSON response, it will always encode things using UTF-8. Since UTF-8 can losslessly represent any Unicode string, this should work.

@tranvictor
Copy link
Author

This is what i got from that print:

r.content = '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xbd\xd8Ys\xaaH\x14\x07\xf0\xafBy_\xaf\x99nv|\xf3*j\\\x12\x

@cakoose
Copy link

cakoose commented Aug 10, 2016

This is very strange.

@braincore: Is there a way to turn on request/response logging?

@greg-db
Copy link
Contributor

greg-db commented Aug 10, 2016

@cakoose This might do the trick, depending on what exactly you want logged:

import httplib
httplib.HTTPConnection.debuglevel = 1

That will have it print the request and response, except for the the response body.

@cakoose
Copy link

cakoose commented Aug 10, 2016

That's exactly what I was looking for, thanks Greg.

@tranvictor: Could you add the debuglevel = 1 thing and test again?

@tranvictor
Copy link
Author

This is what I got:

send: 'POST /2/users/get_current_account HTTP/1.1\r\nHost: api.dropboxapi.com\r\nContent-Length: 4\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: OfficialDropboxPythonSDKv2/6.6.0\r\nConnection: keep-alive\r\nContent-Type: application/json\r\nAuthorization: Bearer <Intentionally hidden>\r\n\r\nnull'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Wed, 10 Aug 2016 06:46:17 GMT
header: Content-Type: application/json
header: Content-Length: 383
header: Connection: keep-alive
header: Cache-Control: no-cache
header: Pragma: no-cache
header: Set-Cookie: gvc=Nzk0MjE2NDg3NTM5MDMzMzAyNzkxNTM1MzI1MTM0OTU3OTY5OTQ%3D; expires=Mon, 09 Aug 2021 06:46:17 GMT; httponly; Path=/; secure
header: X-Content-Type-Options: nosniff
header: X-Dropbox-Http-Protocol: None
header: X-Dropbox-Request-Id: d3bb892415535da43bba605bdc609db9
header: X-Frame-Options: SAMEORIGIN
header: X-Server-Response-Time: 98
r.content = '{"account_id": "dbid:AADOXO4y9eyhAV0pugeh3skHWTC9V935eqE", "name": {"given_name": "S\\u00fan", "sur
FullAccount(account_id=u'dbid:AADOXO4y9eyhAV0pugeh3skHWTC9V935eqE', name=Name(given_name=u'S\xfan', surname=u'Bom', familiar_name=u'S\xfan', display_name=u'S\xfan Bom'), email=u'thanhlich2111@gmail.com', email_verified=True, disabled=False, locale=u'en', referral_link=u'https://db.tt/ol5R3F1x', is_paired=False, account_type=AccountType(u'basic', None), profile_photo_url=None, country=u'VN', team=None, team_member_id=None)
send: 'POST /2/files/list_folder HTTP/1.1\r\nHost: api.dropboxapi.com\r\nContent-Length: 133\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: OfficialDropboxPythonSDKv2/6.6.0\r\nConnection: keep-alive\r\nCookie: gvc=<intentionally hidden>%3D\r\nContent-Type: application/json\r\nAuthorization: Bearer <intentionally hidden>\r\n\r\n{"path": "", "recursive": false, "include_media_info": false, "include_deleted": false, "include_has_explicit_shared_members": false}'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Wed, 10 Aug 2016 06:46:17 GMT
header: Content-Type: application/json
header: Transfer-Encoding: chunked
header: Connection: keep-alive
header: Cache-Control: no-cache
header: Pragma: no-cache
header: X-Content-Type-Options: nosniff
header: X-Dropbox-Http-Protocol: None
header: X-Dropbox-Request-Id: 192ccf4ea924845863db4d37924af1b6
header: X-Frame-Options: SAMEORIGIN
header: X-Server-Response-Time: 154
header: Content-Encoding: gzip
r.content = '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xbd\xd7Y\x8f\xa2J\x14\x07\xf0\xafb|\x1e\xe6R\xec\xf8f#\xca
Traceback (most recent call last):
  File "restore.py", line 7, in <module>
    for entry in dbx.files_list_folder('', recursive = False).entries:
  File "/Library/Python/2.7/site-packages/dropbox/base.py", line 538, in files_list_folder
    None,
  File "/Library/Python/2.7/site-packages/dropbox/dropbox.py", line 217, in request
    request_binary)
  File "/Library/Python/2.7/site-packages/dropbox/dropbox.py", line 300, in request_json_string_with_retry
    request_binary)
  File "/Library/Python/2.7/site-packages/dropbox/dropbox.py", line 397, in request_json_string
    raw_resp = r.content.decode('utf-8')
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte

@cakoose
Copy link

cakoose commented Aug 10, 2016

The server is sending back gzip-encoded content with the Content-Encoding: gzip response header. The HTTP library we use (Requests) should automatically un-gzip the data before putting it in r.content. Unfortunately, something is going wrong, because r.content looks like gzipped content.

Can you try running this script:

import requests
print requests.__version__

import httplib
httplib.HTTPConnection.debuglevel = 1

r = requests.get('https://httpbin.org/gzip')
print "r.content =", repr(r.content)

@tranvictor
Copy link
Author

2.6.1
send: 'GET /gzip HTTP/1.1\r\nHost: httpbin.org\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: python-requests/2.6.1 CPython/2.7.11 Darwin/15.6.0\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Wed, 10 Aug 2016 07:32:06 GMT
header: Content-Type: application/json
header: Content-Length: 198
header: Connection: keep-alive
header: Content-Encoding: gzip
header: Access-Control-Allow-Origin: *
header: Access-Control-Allow-Credentials: true
r.content = '{\n  "gzipped": true, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.6.1 CPython/2.7.11 Darwin/15.6.0"\n  }, \n  "method": "GET", \n  "origin": "118.68.118.189"\n}\n'

Yeah, I know that requests should handle the gzip transparently.

@cakoose
Copy link

cakoose commented Aug 10, 2016

Looks like it's a bug in Requests 2.6.1. The release notes for 2.6.2 say:

Fix regression where compressed data that was sent as chunked data was not properly decompressed. (#2561)

I was able to reproduce the issue myself. Upgrading to a newer version of Requests fixed it.

@cakoose cakoose closed this as completed Aug 10, 2016
@tranvictor
Copy link
Author

Ah thank you, it's fixed after I upgrade to newer version of requests too.

@braincore
Copy link
Contributor

Good one @cakoose! Updated requests dependency to v2.6.2 in 434768d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants