AVRO-1880: Futurize Py2 via BytesIO #720

kojiromike · 2019-11-18T04:37:30Z

Update

The name collision between io and avro.io is resolved in this codebase. It could still be an issue if a user invokes a python module directly on the command line as in python lang/py/build/src/avro/tool.py …. This approach causes python to prepend that path to the pythonpath, which enables the collision. Using python -m avro.tool … is a safe alternative. In another pr we may want to consider installing tool.py via setuptools' console_script entrypoint. However, if we do that, we should probably pick a better name than just "tool".

Update

The problem in avro.tether.tether_task was resolved by replacing the buffer object each time. This is fine; however, the next problem is the name collision between io and avro.io.

Update

I think I found the root cause of the discrepancy. I still don't understand why it behaves this way, and I have asked about it on StackOverflow.

Context

One of the biggest hurdles to making the lang/py implementation work seamlessly in Python 3 is the differences between how Python 2 and Python 3 handle strings and bytes in streams. In python 2 there were effectively two implementations of these types of streams: StringIO.StringIO and cStringIO.StringIO. They both handle byte streams, since Python 2 strings are just that.

There is no such thing as a UnicodeIO in Python 2, but StringIO.StringIO also handles unicode streams. It's confusing and dangerous how it handles both, and I don't understand it.

In Python 3, we use io.BytesIO for byte streams, and io.StringIO for unicode.

In almost every place avro uses it, it wants a byte stream. So we can replace StringIO with io.BytesIO and it just works.

Except in avro.tether.tether_task. If I replace the StringIO in there with io.BytesIO, then test_tether_word_count will produce a very truncated result instead of the mapping it is supposed to. But if I use io.StringIO then the test will fail because the DatumWriter actually does want to write bytes to this stream.

I would appreciate any help someone can lend to how to remove this lingering StringIO.StringIO, which will not work in Python 3.

Jira

Addresses AVRO-1880
References it in the PR title
Adds no dependencies

Tests

Updates existing tests.

Commits

Reference Jira issues in their subject lines.
Follow the guidelines from "How to write a good git commit message"

Documentation

Does not require additional documentation yet. (This is incremental compatibility work.)

This works nearly seamlessly, except for a snag in avro.tether.tether_task. I don't understand it yet.

* Replace buffer instead of truncating because bytesio.truncate does not seek.

kojiromike self-assigned this Nov 18, 2019

probot-autolabeler bot added the Python label Nov 18, 2019

kojiromike force-pushed the AVRO-1880/io.BytesIO branch from 5715970 to 3a475c1 Compare November 18, 2019 04:39

kojiromike changed the title ~~[Help Wanted] AVRO-1880: Futurize Py2 via BytesIO~~ AVRO-1880: Futurize Py2 via BytesIO Nov 19, 2019

kojiromike force-pushed the AVRO-1880/io.BytesIO branch from 9b3f15d to ccc3ec4 Compare November 19, 2019 04:11

probot-autolabeler bot added the build label Nov 19, 2019

kojiromike added 3 commits November 20, 2019 21:40

AVRO-1880: Futurize Py2 via BytesIO

e745e6c

This works nearly seamlessly, except for a snag in avro.tether.tether_task. I don't understand it yet.

AVRO-1880: Replace Buffer to Obviate StringIO

98178e0

AVRO-1880: Clear Namespacing

d0af764

kojiromike force-pushed the AVRO-1880/io.BytesIO branch from 508bce5 to d0af764 Compare November 21, 2019 02:40

kojiromike merged commit 790faee into apache:master Nov 27, 2019

ecopoesis pushed a commit to ecopoesis/avro that referenced this pull request Jan 8, 2020

AVRO-1880: Futurize Py2 via BytesIO (apache#720)

66f2750

* Replace buffer instead of truncating because bytesio.truncate does not seek.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVRO-1880: Futurize Py2 via BytesIO #720

AVRO-1880: Futurize Py2 via BytesIO #720

kojiromike commented Nov 18, 2019 •

edited

AVRO-1880: Futurize Py2 via BytesIO #720

AVRO-1880: Futurize Py2 via BytesIO #720

Conversation

kojiromike commented Nov 18, 2019 • edited

Update

Update

Update

Context

Jira

Tests

Commits

Documentation

kojiromike commented Nov 18, 2019 •

edited