AVRO-1880: Futurize Py2 via BytesIO #720
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Update
The name collision between
io
andavro.io
is resolved in this codebase. It could still be an issue if a user invokes a python module directly on the command line as inpython lang/py/build/src/avro/tool.py …
. This approach causes python to prepend that path to the pythonpath, which enables the collision. Usingpython -m avro.tool …
is a safe alternative. In another pr we may want to consider installingtool.py
via setuptools'console_script
entrypoint. However, if we do that, we should probably pick a better name than just "tool".Update
The problem in
avro.tether.tether_task
was resolved by replacing the buffer object each time. This is fine; however, the next problem is the name collision betweenio
andavro.io
.Update
I think I found the root cause of the discrepancy. I still don't understand why it behaves this way, and I have asked about it on StackOverflow.
Context
One of the biggest hurdles to making the lang/py implementation work seamlessly in Python 3 is the differences between how Python 2 and Python 3 handle strings and bytes in streams. In python 2 there were effectively two implementations of these types of streams:
StringIO.StringIO
andcStringIO.StringIO
. They both handle byte streams, since Python 2 strings are just that.There is no such thing as a
UnicodeIO
in Python 2, butStringIO.StringIO
also handles unicode streams. It's confusing and dangerous how it handles both, and I don't understand it.In Python 3, we use
io.BytesIO
for byte streams, andio.StringIO
for unicode.In almost every place avro uses it, it wants a byte stream. So we can replace
StringIO
withio.BytesIO
and it just works.Except in
avro.tether.tether_task
. If I replace theStringIO
in there withio.BytesIO
, thentest_tether_word_count
will produce a very truncated result instead of the mapping it is supposed to. But if I useio.StringIO
then the test will fail because theDatumWriter
actually does want to write bytes to this stream.I would appreciate any help someone can lend to how to remove this lingering
StringIO.StringIO
, which will not work in Python 3.Jira
Tests
Commits
Documentation