-
-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python 2.7 support #66
Conversation
dscottcs
commented
Jan 4, 2017
- write custom 'compress' and 'decompress' functions for gzip
- use bytearray instead of memoryview for byte arrays
- default mkdir to 'mkdir -p' subprocess call
- write custom 'compress' and 'decompress' functions for gzip - use bytearray instead of memoryview for byte arrays - default mkdir to 'mkdir -p' subprocess call
Thanks for the efforts here @dscottcs . Two comments:
PY2 = sys.version_info[0] == 2
...
if PY2:
...
else:
... |
fastparquet/util.py
Outdated
except TypeError as e: | ||
#Python2.7 equivalent | ||
import subprocess | ||
subprocess.call(['mkdir', '-p', f]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to assume posix. Could try/except the possible OSError without exist_ok.
Also added Python2 support to .travis.yml
Also disable Python2 speedup tests
Note that for Python2.7 Cython speedups are not supported for UTF8 encoded data, nor is BSON supported at all. In the case of BSON it may just be that we need a different kind of test for it - the existing test I couldn't figure out how to pass. If we end up merging this work we should make it clear in the docs that Python3 is preferred and Python2 support is provisional. |
fastparquet/compression.py
Outdated
from io import BytesIO | ||
bio = BytesIO() | ||
f = gzip.GzipFile(mode='wb', | ||
compresslevel=compresslevel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest that the compression level could be a module constant, and also applied to gzip.compress for py3.
BSON does work on py2, but requires the installation of the bson package (in auto conda channel for linux, in mdtakashima for osx, haven't tried win). Your code seems to run fine with speedups for me, at least for reading, so long as I have unicode_literals in the columns benchmark (otherwise the outputs are not differentiated from bytes). |
fastparquet/util.py
Outdated
""" | ||
Return True if Python version is 2.x | ||
""" | ||
return (sys.version_info[0] == 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend instead
PY2 = sys.version_info[0] == 2
PY3 = sys.version_info[0] == 3
and then `if PY2`
I've seen this idiom in a few other projects.
fastparquet/util.py
Outdated
return bytearray(raw_bytes) if is_v2() else memoryview(raw_bytes) | ||
|
||
def str_type(): | ||
return basestring if is_v2() else str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just define this as a literal rather than a function?
Simplifies paths, reduces memory footprint when there are multiple pages per row-group
Enter fix for dask@dfd025f#commitcomment-20375584
Note |
Alloz zero-row dataframes to be written
Easier assign
ping @dscottcs : do you need help with any of the comments above? |
No just busy. I'll try to get to it in a day or so.
…Sent from my iPhone
On Jan 19, 2017, at 1:45 PM, Martin Durant <notifications@github.com<mailto:notifications@github.com>> wrote:
ping @dscottcs<https://github.com/dscottcs> : do you need help with any of the comments above?
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#66 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE-cpYK7NzHZtncm7GYECbSSfMdRfBosks5rT9ltgaJpZM4LbFfO>.
|
Previously, reading of definition levels stopped when there were enough to satisfy the given row count for the data page in question. It turns out there can be unused bytes after this - but that the total length of the definitions block is correctly specified, so the extra bytes can simply be skipped.
Appropriate for definitions reading with extra junk
If definitions has extra unused bytes, seek to right location
Thanks for the tip. Will test now. |
Not sure how to fix the dependency issue with python-snappy on Py3.6. I'm able to test locally without issues. |
Raised an issue. I'm not sure who sees this We can give them a little time to respond, and continue with py35/27 only if we hear nothing. |
OK, I built it myself, that can do for the time being: |
(Probably) temporary stopgap to get Python3.6 tests to work.
fastparquet/converted_types.py
Outdated
return np.array([int.from_bytes(d, byteorder='big', signed=True) * | ||
scale_factor for d in data]) | ||
|
||
return np.array([int(str(d).encode('hex'), 16) * scale_factor for d in data]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line never gets called.
def gzip_decompress_v2(data): | ||
import zlib | ||
return zlib.decompress(data, | ||
16+15) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you could comment on this magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's a bit arcane. It has to do with zlib 'windowBits' parameter. From the zlib manual:
The windowBits parameter is the base two logarithm of the window size (the size of the history buffer). It should be in the range 8..15 for this version of the library. Larger values of this parameter result in better compression at the expense of memory usage. The default value is 15 if deflateInit is used instead.
The '16' is a base value that indicates that gzip decompression must take place. The '15' is a maximum window size, as indicated above.
I can add a brief comment to the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please add a comment!
fastparquet/converted_types.py
Outdated
scale_factor for d in data]) | ||
if PY2: | ||
def from_bytes(d): | ||
return int(codecs.encode(d, 'hex'), 16) if len(d) else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems binascii.b2a_hex(d)
is faster than codecs.encode(d, 'hex')
. I'd also add it's more well-known :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference:
>>> d = bytes(bytearray(range(24)))
>>> %timeit int(codecs.encode(d, 'hex'), 16) if len(d) else 0
1000000 loops, best of 3: 1.49 µs per loop
>>> %timeit int(binascii.b2a_hex(d), 16) if len(d) else 0
1000000 loops, best of 3: 729 ns per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent. I will use the faster method.
@@ -126,7 +126,7 @@ def read_data_page(f, helper, header, metadata, skip_nulls=False, | |||
num = (encoding.read_unsigned_var_int(io_obj) >> 1) * 8 | |||
values = io_obj.read(num * bit_width // 8).view('int%i' % bit_width) | |||
elif bit_width: | |||
values = encoding.Numpy32(np.zeros(daph.num_values, | |||
values = encoding.Numpy32(np.empty(daph.num_values-num_nulls+7, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised, does your PR fix bugs in addition to making the code base 2.7-compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this ended up in my PR. This was a change by Martin on 1/27/2017 (according to git blame). May be a merge error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I recognize that, the green line is the correct one.
fastparquet/test/test_output.py
Outdated
@@ -1,3 +1,5 @@ | |||
# -*- coding: utf-8 -*- | |||
from __future__ import unicode_literals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this. Python 2-compatible APis should accept Python 2 str
in most places (for example column names).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. The problem was unicode literals in some of the assertions. Setting these explicitly to unicode seems to solve the problem.
fastparquet/util.py
Outdated
try: | ||
b = bytes(s) | ||
except UnicodeEncodeError as e: | ||
u = u''.join((s)).encode('utf-8').strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Er, what is the reasoning behind this line?
Let me suggest a different implementation:
if PY2:
def ensure_bytes(s):
return s.encode('utf-8') if isinstance(s, unicode) else s
else:
def ensure_bytes(s):
return s.encode('utf-8') if isinstance(s, str) else s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
fastparquet/util.py
Outdated
return bytearray(raw_bytes) if PY2 else memoryview(raw_bytes) | ||
|
||
|
||
def str_type(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here is ok, but wouldn't it be simpler to vendor six
(https://pythonhosted.org/six/) instead of reimplementing this kind of thing?
(also, this needn't be a function call)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
fastparquet/util.py
Outdated
|
||
|
||
def byte_buffer(raw_bytes): | ||
return bytearray(raw_bytes) if PY2 else memoryview(raw_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation for this is a bit unclear, or at least the semantics are a bit vague. memoryview
produces a (possible writable) view, while bytearray
produces a mutable copy.
If this is about np.frombuffer
, then it's true that it doesn't accept memoryview on Python 2, but OTOH it seems to accept the old-style buffer
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
def check_column_names(columns, *args): | ||
"""Ensure that parameters listing column names have corresponding columns""" | ||
for arg in args: | ||
if isinstance(arg, (tuple, list)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it's not a tuple or list? Just ignored? And why are all args
tested one by one, instead of a single of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another of Martin's commits - from 1/30/2017. Again, not sure how it ended up in my PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things that are not tuple or list do automatically pass, by design.
A few possible arguments can be bool or string as well as a list of columns.
(yes, this is my code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's not part of the PR then my comments are probably OT. Though I must add that what this function is supposed to do, and why it is coded as it is, is still a mystery to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this PR is merged, I intend to package for release, so I can do a spot of doc-string updating at that point.
@@ -137,7 +140,10 @@ def convert(data, se): | |||
out = data.values | |||
elif dtype == "O": | |||
if converted_type == parquet_thrift.ConvertedType.UTF8: | |||
out = array_encode_utf8(data) | |||
if PY2: | |||
out = np.array([x.encode('utf8') for x in data], dtype="O") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally array_encode_utf8
should be fixed instead, though I don't have a problem personally if Python 2 takes a backseat in terms of performance :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No such luck. array_encode_utf8 is not friendly to Python2 at all. The performance issue is a known limitation, at least for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not appear to be true:
fastparquet.speedups.array_encode_utf8(np.array([u'ef¬∆∫˚'], dtype="O"))
array(['ef\xc2\xac\xe2\x88\x86\xe2\x88\xab\xcb\x9a'], dtype=object)
I suspect, again, that the difference is explicitly labeling the input as unicode. Without that, it would be the equivalent of py3's bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some small comments here, but generally a +1 to @pitrou 's comments.
fastparquet/dataframe.py
Outdated
@@ -37,7 +38,7 @@ def empty(types, size, cats=None, cols=None, index_type=None, index_name=None): | |||
views = {} | |||
|
|||
cols = cols if cols is not None else range(cols) | |||
if isinstance(types, str): | |||
if isinstance(types, str_type()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice not to have to call anything here. I recommend doing the check at import time
if PY2:
str_type = ...
else:
str_type = ...
from .util import str_type
Also we might want to create a separate compatibility.py files rather than util. Or, as @pitrou suggests, simply use six
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by falling back to the six module.
fastparquet/util.py
Outdated
if not os.path.exists(f): | ||
os.makedirs(f) | ||
else: | ||
os.makedirs(f, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, would prefer checks like this to happen at import time
if PY2:
def makedirs(...):
...
else:
def makedirs(...):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK fair enough.
@dscottcs , python-snappy is now available in conda-forge conda-forge/python-snappy-feedstock#5 , so the last change to .travis.yml can be reverted. |
@dscottcs, could you try merging from master, so that unrelated changes disappear from the PR? Thank you! |
Is |