Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 2.7 support #66

Merged
merged 48 commits into from
Feb 16, 2017
Merged

Conversation

dscottcs
Copy link

@dscottcs dscottcs commented Jan 4, 2017

  • write custom 'compress' and 'decompress' functions for gzip
  • use bytearray instead of memoryview for byte arrays
  • default mkdir to 'mkdir -p' subprocess call

- write custom 'compress' and 'decompress' functions for gzip
- use bytearray instead of memoryview for byte arrays
- default mkdir to 'mkdir -p' subprocess call
@mrocklin
Copy link
Member

mrocklin commented Jan 4, 2017

Thanks for the efforts here @dscottcs . Two comments:

  1. It would be good to add a python: 2.7 entry to the travis.ci test matrix in the .travis.yml file (just add an extra line below 3.5
  2. It might be good to replace the try-except pattern here with if-else
PY2 = sys.version_info[0] == 2

...

if PY2:
    ...
else:
    ...

@mrocklin mrocklin mentioned this pull request Jan 5, 2017
except TypeError as e:
#Python2.7 equivalent
import subprocess
subprocess.call(['mkdir', '-p', f])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to assume posix. Could try/except the possible OSError without exist_ok.

@dscottcs
Copy link
Author

dscottcs commented Jan 9, 2017

Note that for Python2.7 Cython speedups are not supported for UTF8 encoded data, nor is BSON supported at all. In the case of BSON it may just be that we need a different kind of test for it - the existing test I couldn't figure out how to pass. If we end up merging this work we should make it clear in the docs that Python3 is preferred and Python2 support is provisional.

from io import BytesIO
bio = BytesIO()
f = gzip.GzipFile(mode='wb',
compresslevel=compresslevel,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest that the compression level could be a module constant, and also applied to gzip.compress for py3.

@martindurant
Copy link
Member

BSON does work on py2, but requires the installation of the bson package (in auto conda channel for linux, in mdtakashima for osx, haven't tried win).

Your code seems to run fine with speedups for me, at least for reading, so long as I have unicode_literals in the columns benchmark (otherwise the outputs are not differentiated from bytes).

"""
Return True if Python version is 2.x
"""
return (sys.version_info[0] == 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend instead

PY2 = sys.version_info[0] == 2
PY3 = sys.version_info[0] == 3

and then `if PY2`

I've seen this idiom in a few other projects.

return bytearray(raw_bytes) if is_v2() else memoryview(raw_bytes)

def str_type():
return basestring if is_v2() else str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just define this as a literal rather than a function?

Martin Durant added 3 commits January 11, 2017 09:59
Simplifies paths, reduces memory footprint when there are multiple
pages per row-group
@martindurant
Copy link
Member

Note int.from_bytes in converted_types.convert - this does not exist in py2.

@martindurant martindurant mentioned this pull request Jan 16, 2017
@martindurant
Copy link
Member

ping @dscottcs : do you need help with any of the comments above?

@dscottcs
Copy link
Author

dscottcs commented Jan 19, 2017 via email

Martin Durant and others added 4 commits January 28, 2017 18:03
Previously, reading of definition levels stopped when there were enough to
satisfy the given row count for the data page in question. It turns out
there can be unused bytes after this - but that the total length of the
definitions block is correctly specified, so the extra bytes can simply
be skipped.
Appropriate for definitions reading with extra junk
If definitions has extra unused bytes, seek to right location
@dscottcs
Copy link
Author

dscottcs commented Feb 3, 2017

Thanks for the tip. Will test now.

@dscottcs
Copy link
Author

dscottcs commented Feb 3, 2017

Not sure how to fix the dependency issue with python-snappy on Py3.6. I'm able to test locally without issues.

@martindurant
Copy link
Member

martindurant commented Feb 3, 2017

Raised an issue. I'm not sure who sees this
conda-forge/python-snappy-feedstock#4

We can give them a little time to respond, and continue with py35/27 only if we hear nothing.

@martindurant
Copy link
Member

OK, I built it myself, that can do for the time being:
- conda install -c mdurant python-snappy

(Probably) temporary stopgap to get Python3.6 tests to work.
return np.array([int.from_bytes(d, byteorder='big', signed=True) *
scale_factor for d in data])

return np.array([int(str(d).encode('hex'), 16) * scale_factor for d in data])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line never gets called.

def gzip_decompress_v2(data):
import zlib
return zlib.decompress(data,
16+15)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could comment on this magic number?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's a bit arcane. It has to do with zlib 'windowBits' parameter. From the zlib manual:

The windowBits parameter is the base two logarithm of the window size (the size of the history buffer). It should be in the range 8..15 for this version of the library. Larger values of this parameter result in better compression at the expense of memory usage. The default value is 15 if deflateInit is used instead.

The '16' is a base value that indicates that gzip decompression must take place. The '15' is a maximum window size, as indicated above.

I can add a brief comment to the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please add a comment!

scale_factor for d in data])
if PY2:
def from_bytes(d):
return int(codecs.encode(d, 'hex'), 16) if len(d) else 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems binascii.b2a_hex(d) is faster than codecs.encode(d, 'hex'). I'd also add it's more well-known :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference:

>>> d = bytes(bytearray(range(24)))
>>> %timeit int(codecs.encode(d, 'hex'), 16) if len(d) else 0
1000000 loops, best of 3: 1.49 µs per loop
>>> %timeit int(binascii.b2a_hex(d), 16) if len(d) else 0
1000000 loops, best of 3: 729 ns per loop

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. I will use the faster method.

@@ -126,7 +126,7 @@ def read_data_page(f, helper, header, metadata, skip_nulls=False,
num = (encoding.read_unsigned_var_int(io_obj) >> 1) * 8
values = io_obj.read(num * bit_width // 8).view('int%i' % bit_width)
elif bit_width:
values = encoding.Numpy32(np.zeros(daph.num_values,
values = encoding.Numpy32(np.empty(daph.num_values-num_nulls+7,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised, does your PR fix bugs in addition to making the code base 2.7-compatible?

Copy link
Author

@dscottcs dscottcs Feb 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this ended up in my PR. This was a change by Martin on 1/27/2017 (according to git blame). May be a merge error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I recognize that, the green line is the correct one.

@@ -1,3 +1,5 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this. Python 2-compatible APis should accept Python 2 str in most places (for example column names).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. The problem was unicode literals in some of the assertions. Setting these explicitly to unicode seems to solve the problem.

try:
b = bytes(s)
except UnicodeEncodeError as e:
u = u''.join((s)).encode('utf-8').strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, what is the reasoning behind this line?
Let me suggest a different implementation:

if PY2:
    def ensure_bytes(s):
        return s.encode('utf-8') if isinstance(s, unicode) else s
else:
    def ensure_bytes(s):
        return s.encode('utf-8') if isinstance(s, str) else s

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

return bytearray(raw_bytes) if PY2 else memoryview(raw_bytes)


def str_type():
Copy link
Member

@pitrou pitrou Feb 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is ok, but wouldn't it be simpler to vendor six (https://pythonhosted.org/six/) instead of reimplementing this kind of thing?
(also, this needn't be a function call)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK



def byte_buffer(raw_bytes):
return bytearray(raw_bytes) if PY2 else memoryview(raw_bytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for this is a bit unclear, or at least the semantics are a bit vague. memoryview produces a (possible writable) view, while bytearray produces a mutable copy.

If this is about np.frombuffer, then it's true that it doesn't accept memoryview on Python 2, but OTOH it seems to accept the old-style buffer object.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

def check_column_names(columns, *args):
"""Ensure that parameters listing column names have corresponding columns"""
for arg in args:
if isinstance(arg, (tuple, list)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it's not a tuple or list? Just ignored? And why are all args tested one by one, instead of a single of them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another of Martin's commits - from 1/30/2017. Again, not sure how it ended up in my PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things that are not tuple or list do automatically pass, by design.
A few possible arguments can be bool or string as well as a list of columns.
(yes, this is my code)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's not part of the PR then my comments are probably OT. Though I must add that what this function is supposed to do, and why it is coded as it is, is still a mystery to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR is merged, I intend to package for release, so I can do a spot of doc-string updating at that point.

@@ -137,7 +140,10 @@ def convert(data, se):
out = data.values
elif dtype == "O":
if converted_type == parquet_thrift.ConvertedType.UTF8:
out = array_encode_utf8(data)
if PY2:
out = np.array([x.encode('utf8') for x in data], dtype="O")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally array_encode_utf8 should be fixed instead, though I don't have a problem personally if Python 2 takes a backseat in terms of performance :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No such luck. array_encode_utf8 is not friendly to Python2 at all. The performance issue is a known limitation, at least for now.

Copy link
Member

@martindurant martindurant Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not appear to be true:

fastparquet.speedups.array_encode_utf8(np.array([u'ef¬∆∫˚'], dtype="O"))
array(['ef\xc2\xac\xe2\x88\x86\xe2\x88\xab\xcb\x9a'], dtype=object)

I suspect, again, that the difference is explicitly labeling the input as unicode. Without that, it would be the equivalent of py3's bytes.

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some small comments here, but generally a +1 to @pitrou 's comments.

@@ -37,7 +38,7 @@ def empty(types, size, cats=None, cols=None, index_type=None, index_name=None):
views = {}

cols = cols if cols is not None else range(cols)
if isinstance(types, str):
if isinstance(types, str_type()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice not to have to call anything here. I recommend doing the check at import time

if PY2:
    str_type = ...
else:
    str_type = ...

from .util import str_type

Also we might want to create a separate compatibility.py files rather than util. Or, as @pitrou suggests, simply use six.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by falling back to the six module.

if not os.path.exists(f):
os.makedirs(f)
else:
os.makedirs(f, exist_ok=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, would prefer checks like this to happen at import time

if PY2:
    def makedirs(...):
        ...
else:
    def makedirs(...):
        ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK fair enough.

@martindurant
Copy link
Member

@dscottcs , python-snappy is now available in conda-forge conda-forge/python-snappy-feedstock#5 , so the last change to .travis.yml can be reverted.

@pitrou
Copy link
Member

pitrou commented Feb 7, 2017

@dscottcs, could you try merging from master, so that unrelated changes disappear from the PR? Thank you!

@martindurant
Copy link
Member

Is six now a dependency?

@martindurant martindurant merged commit 4cfa827 into dask:master Feb 16, 2017
@dscottcs dscottcs deleted the fix/python2.7_support branch March 28, 2017 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants