Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileObject with mode "r" reads bytes instead of string #1441

Closed
wiggin15 opened this issue Jul 28, 2019 · 5 comments
Closed

FileObject with mode "r" reads bytes instead of string #1441

wiggin15 opened this issue Jul 28, 2019 · 5 comments
Milestone

Comments

@wiggin15
Copy link
Contributor

@wiggin15 wiggin15 commented Jul 28, 2019

  • gevent version: 1.4.0
  • Python version: 3.5.1
  • Operating System: macOS 10.14.5

Description:

I am using gevent.fileobject.FileObject with mode "r" (read, not binary). In Python 3, I expect this mode to return str objects when reading the file content (only passing binary mode will read bytes). However, I am getting bytes and not str with this mode.

With regular Python file object:

>>> fd = open("foo.json", "r")
>>> type(fd.read())
<class 'str'>

With gevent.fileobject.FileObject:

>>> from gevent.fileobject import FileObject
>>> fd = open("/tmp/bla.json", "r")
>>> fo = FileObject(fd, mode="r")
>>> type(fo.read())
<class 'bytes'>

This makes it impossible to use FileObject with json.load (which reads the content by itself, not giving me a chance to decode the content).

@wiggin15

This comment has been minimized.

Copy link
Contributor Author

@wiggin15 wiggin15 commented Jul 28, 2019

The same goes for writing. Opening with mode "w" should expect writing strings but it expects bytes:

>>> from gevent.fileobject import FileObject
>>> fd = open("foo.json", "w")
>>> fo = FileObject(fd, mode="w")
>>> fo.write("bar")
Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'
@wiggin15

This comment has been minimized.

Copy link
Contributor Author

@wiggin15 wiggin15 commented Jul 30, 2019

When going over the code (trying to create a PR), I also found some other bugs with this object. The 'U' mode causes some unexpected results in some cases:

  1. On Python 2, using 'U' causes the stream to be wrapped with TextIOWrapper, which uses unicode objects. gevent handles reading by re-encoding read data, but does not handle writes. This causes the following bug:
from gevent.fileobject import FileObject
fd = open("foo", "w")
fd.write('bar')       # works as expected
fo = FileObject(fd, mode="wU")  # shouldn't work anyway, but does
fo.write('foo')       # TypeError: write() argument 1 must be unicode, not str
  1. On Python 3, using 'U' shouldn't have any effect, but it does because we turn on _translate in this case, so TextIOWrapper is used, forcing read/write types to be str instead of bytes:
from gevent.fileobject import FileObject
fd = open("foo", "rbU")
fo = FileObject(fd, mode="rbU")
type(fo.read())    # expected "bytes", got "str"
  1. On Python 2, mode 't' is a little odd. Using mode 'rU' would read 'str' (good); using mode 'rt' would also read str (good) but using mode 'rUt' would read 'unicode' (not supposed to).
wiggin15 added a commit to Infinidat/gevent that referenced this issue Jul 31, 2019
@jamadden jamadden added this to the 1.5.0 milestone Sep 13, 2019
@jamadden jamadden closed this in 5bf0ca8 Dec 2, 2019
@wiggin15

This comment has been minimized.

Copy link
Contributor Author

@wiggin15 wiggin15 commented Dec 2, 2019

@jamadden thanks for having a look at this and working out a fix!

Unfortunately there are still a few problems even with the new code. Some of the problems I mentioned earlier in this issue still reproduce, specifically:

  1. On Python 3: reading with mode "rbU" reads "str" instead of "bytes"
  2. On Python 2: writing with mode "wU" expects writing "unicode" instead of "str"
  3. mode "wU" is actually not applicable and should be rejected

Should I open a new issue for these problems?

In addition, the new behavior mentioned: "If 't' is given, they will always work with unicode strings" sort of breaks compatibility with regular file objects on Python 2. Opening a file in Python 2 with "t" reads "str" type, but now gevent FileObject will read "unicode" type, which, like the original problem I encountered, can cause trouble when passing a FileObject opened with this mode to a function that expects to read "str" in this case.

@jamadden

This comment has been minimized.

Copy link
Member

@jamadden jamadden commented Dec 3, 2019

On Python 3: reading with mode "rbU" reads "str" instead of "bytes"

That appears to be a legit issue. It's obviously untested. I'll look into it. Thanks!

On Python 2: writing with mode "wU" expects writing "unicode" instead of "str"

Moot because...

mode "wU" is actually not applicable and should be rejected

And it is, raising a ValueError. If there are corner cases where it doesn't raise that...well, it's an illegal mode, behaviour is undefined. Don't do that.

In addition, the new behavior mentioned: "If 't' is given, they will always work with unicode strings" sort of breaks compatibility with regular file objects on Python 2

It depends on how the "regular file object" was opened. io.open interprets t this way on all platforms and Python versions, and that's what FileObject uses. Previously the behaviour was inconsistent depending on platform. Since text mode is the default (which was also inconsistent based on platform), if you want "native strings" and not text, don't specify 't'; in this way a user can explicitly access text, bytes, and native strings, in a manner that's version and platform independent.

I think there are no places that FileObject is silently monkey-patched in a way that accepts an arbitrary mode string; users must explicitly choose to use it, and the behaviour was so wildly inconsistent before that explicitly specifying 'rb' (or 'wb') everywhere was about the only thing that worked. So I think any compatibility impact of the newly expanded options and consistency will be minimal.

@wiggin15

This comment has been minimized.

Copy link
Contributor Author

@wiggin15 wiggin15 commented Dec 3, 2019

And it is, raising a ValueError. If there are corner cases where it doesn't raise that...well, it's an illegal mode, behaviour is undefined. Don't do that.

open raises ValueError with this mode, but FileObject doesn't. Technically users can do:

from gevent.fileobject import FileObject
fd = open("foo", "w")
fo = FileObject(fd, mode="wU")
fo.write("I cause TypeError on Python 2")
jamadden added a commit that referenced this issue Dec 3, 2019
… just on Py2.

Keeping control this way simplifies and unifies the implementations.

- Fix a concurrency bug in FileObjectPosix that we were hiding with buffering and a RuntimeError.
- Add all the error checking the stdlib does, with a few exceptions that make sense.

See #1441.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.