Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-1251] Upgrade from buffer to memoryview for Python 3 #4820

Closed
wants to merge 1 commit into from

Conversation

cclauss
Copy link

@cclauss cclauss commented Mar 7, 2018

buffer was removed in Python 3 in favor of memoryview.

DESCRIPTION HERE


Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
  • Write a pull request description that is detailed enough to understand:
    • What the pull request does
    • Why it does it
    • How it does it
    • Why this approach
  • Each commit in the pull request should have a meaningful subject line and body.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

@cclauss
Copy link
Author

cclauss commented Mar 13, 2018

@holdenk Your review please?

@holdenk
Copy link
Contributor

holdenk commented Mar 14, 2018

LGTM but from my memory I think I saw a similar PR, was that also yours? (Or am I just imagining things).

@cclauss
Copy link
Author

cclauss commented Mar 17, 2018

This is the only PR that I have that touches this code. I mentioned this issue in #4798 but I did not propose a fix in that PR.

@aaltay Your review please?

@@ -309,8 +309,8 @@ def _decompress_bytes(data, codec):

# Compressed data includes a 4-byte CRC32 checksum which we verify.
# We take care to avoid extra copies of data while slicing large objects
# by use of a buffer.
result = snappy.decompress(buffer(data)[:-4])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this change? When I ran it, it fails with: TypeError: argument 1 must be string or read-only buffer, not memoryview.

This is because, a slice of a buffer will return the raw data, but in case of memoryview a slice will return a memoryview object for that sub section.

@cclauss
Copy link
Author

cclauss commented Mar 18, 2018

Thanks for catching this. I did not have an effective way to test. Reading through:

memoryview exists in all versions of Python that Beam supports so once we find a memoryview-based solution that works, we should be able to drop buffer altogether.

@cclauss
Copy link
Author

cclauss commented Mar 19, 2018

@aaltay Can you please retry with this update?

@aaltay
Copy link
Member

aaltay commented Mar 19, 2018

No, the changed version also does not work. This six.binary_type(memoryview(data)[:-4]) results in the literal string of the form <memory at 0x7f62ee334510> and fails with snappy.UncompressError: Error while decompressing: invalid input

Besides binary_type is just str, even if it worked as expected in this case it would have created a copy of data, which beats the purpose.

The real solution here would be to upgrade snappy to accept memoryview as an argument. If we cannot do that, we can remove the optimization and settle for snappy.decompress(data[:-4]). Or perhaps better we can conditionally keep the buffer for python2 only.

CC'ing a few people who might have an idea of the impact of copying data here:
cc: @chamikaramj @katsiapis

@cclauss
Copy link
Author

cclauss commented Mar 20, 2018

Are we using the current python-snappy 0.52? Perhaps @martindurant has some ideas for us.

@aaltay
Copy link
Member

aaltay commented Mar 20, 2018

Yes we are depending on the python-snappy pypi. Dataflow has 0.5.1 installed, not the latest 0.5.2. But I do not think there is a change related to this. I tested with the latest available version for this PR.

@cclauss
Copy link
Author

cclauss commented Mar 21, 2018

@stale
Copy link

stale bot commented Jun 7, 2018

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@stale stale bot added the wontfix label Jun 7, 2018
@stale
Copy link

stale bot commented Jun 14, 2018

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@cclauss
Copy link
Author

cclauss commented Jul 2, 2018

A fix has been checked into intake/python-snappy#72

@cclauss cclauss deleted the buffer-to-memoryview branch July 4, 2018 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants