Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-1880: Futurize Py2 via BytesIO #720

Merged
merged 3 commits into from Nov 27, 2019

Conversation

kojiromike
Copy link
Contributor

@kojiromike kojiromike commented Nov 18, 2019

Update

The name collision between io and avro.io is resolved in this codebase. It could still be an issue if a user invokes a python module directly on the command line as in python lang/py/build/src/avro/tool.py …. This approach causes python to prepend that path to the pythonpath, which enables the collision. Using python -m avro.tool … is a safe alternative. In another pr we may want to consider installing tool.py via setuptools' console_script entrypoint. However, if we do that, we should probably pick a better name than just "tool".

Update

The problem in avro.tether.tether_task was resolved by replacing the buffer object each time. This is fine; however, the next problem is the name collision between io and avro.io.

Update

I think I found the root cause of the discrepancy. I still don't understand why it behaves this way, and I have asked about it on StackOverflow.

Context

One of the biggest hurdles to making the lang/py implementation work seamlessly in Python 3 is the differences between how Python 2 and Python 3 handle strings and bytes in streams. In python 2 there were effectively two implementations of these types of streams: StringIO.StringIO and cStringIO.StringIO. They both handle byte streams, since Python 2 strings are just that.

There is no such thing as a UnicodeIO in Python 2, but StringIO.StringIO also handles unicode streams. It's confusing and dangerous how it handles both, and I don't understand it.

In Python 3, we use io.BytesIO for byte streams, and io.StringIO for unicode.

In almost every place avro uses it, it wants a byte stream. So we can replace StringIO with io.BytesIO and it just works.

Except in avro.tether.tether_task. If I replace the StringIO in there with io.BytesIO, then test_tether_word_count will produce a very truncated result instead of the mapping it is supposed to. But if I use io.StringIO then the test will fail because the DatumWriter actually does want to write bytes to this stream.

I would appreciate any help someone can lend to how to remove this lingering StringIO.StringIO, which will not work in Python 3.

Jira

  • Addresses AVRO-1880
  • References it in the PR title
  • Adds no dependencies

Tests

  • Updates existing tests.

Commits

Documentation

  • Does not require additional documentation yet. (This is incremental compatibility work.)

@kojiromike kojiromike self-assigned this Nov 18, 2019
@kojiromike kojiromike changed the title [Help Wanted] AVRO-1880: Futurize Py2 via BytesIO AVRO-1880: Futurize Py2 via BytesIO Nov 19, 2019
This works nearly seamlessly, except for a snag in avro.tether.tether_task. I don't understand it yet.
@kojiromike kojiromike merged commit 790faee into apache:master Nov 27, 2019
ecopoesis pushed a commit to ecopoesis/avro that referenced this pull request Jan 8, 2020
* Replace buffer instead of truncating because bytesio.truncate does not seek.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant