Python 2.7 support #66

dscottcs · 2017-01-04T21:28:15Z

write custom 'compress' and 'decompress' functions for gzip
use bytearray instead of memoryview for byte arrays
default mkdir to 'mkdir -p' subprocess call

- write custom 'compress' and 'decompress' functions for gzip - use bytearray instead of memoryview for byte arrays - default mkdir to 'mkdir -p' subprocess call

mrocklin · 2017-01-04T21:48:55Z

Thanks for the efforts here @dscottcs . Two comments:

It would be good to add a python: 2.7 entry to the travis.ci test matrix in the .travis.yml file (just add an extra line below 3.5
It might be good to replace the try-except pattern here with if-else

PY2 = sys.version_info[0] == 2

...

if PY2:
    ...
else:
    ...

martindurant · 2017-01-06T18:23:24Z

fastparquet/util.py

+    except TypeError as e:
+        #Python2.7 equivalent
+        import subprocess
+        subprocess.call(['mkdir', '-p', f])


We don't want to assume posix. Could try/except the possible OSError without exist_ok.

Also added Python2 support to .travis.yml

Also disable Python2 speedup tests

dscottcs · 2017-01-09T23:52:53Z

Note that for Python2.7 Cython speedups are not supported for UTF8 encoded data, nor is BSON supported at all. In the case of BSON it may just be that we need a different kind of test for it - the existing test I couldn't figure out how to pass. If we end up merging this work we should make it clear in the docs that Python3 is preferred and Python2 support is provisional.

martindurant · 2017-01-10T14:08:37Z

fastparquet/compression.py

+        from io import BytesIO
+        bio = BytesIO()
+        f = gzip.GzipFile(mode='wb',
+                          compresslevel=compresslevel,


Suggest that the compression level could be a module constant, and also applied to gzip.compress for py3.

martindurant · 2017-01-10T14:36:25Z

BSON does work on py2, but requires the installation of the bson package (in auto conda channel for linux, in mdtakashima for osx, haven't tried win).

Your code seems to run fine with speedups for me, at least for reading, so long as I have unicode_literals in the columns benchmark (otherwise the outputs are not differentiated from bytes).

mrocklin · 2017-01-10T14:38:31Z

fastparquet/util.py

+    """
+    Return True if Python version is 2.x
+    """
+    return (sys.version_info[0] == 2)


I recommend instead

PY2 = sys.version_info[0] == 2 PY3 = sys.version_info[0] == 3 and then `if PY2`

I've seen this idiom in a few other projects.

mrocklin · 2017-01-10T14:39:32Z

fastparquet/util.py

+    return bytearray(raw_bytes) if is_v2() else memoryview(raw_bytes)
+
+def str_type():
+    return basestring if is_v2() else str


Perhaps just define this as a literal rather than a function?

Simplifies paths, reduces memory footprint when there are multiple pages per row-group

Enter fix for dask@dfd025f#commitcomment-20375584

Fixes dask#68

martindurant · 2017-01-11T20:48:58Z

Note int.from_bytes in converted_types.convert - this does not exist in py2.

Alloz zero-row dataframes to be written

Easier assign

martindurant · 2017-01-19T21:45:17Z

ping @dscottcs : do you need help with any of the comments above?

dscottcs · 2017-01-19T21:47:19Z

No just busy. I'll try to get to it in a day or so.

…

Sent from my iPhone On Jan 19, 2017, at 1:45 PM, Martin Durant <notifications@github.com<mailto:notifications@github.com>> wrote: ping @dscottcs<https://github.com/dscottcs> : do you need help with any of the comments above? - You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#66 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE-cpYK7NzHZtncm7GYECbSSfMdRfBosks5rT9ltgaJpZM4LbFfO>.

Previously, reading of definition levels stopped when there were enough to satisfy the given row count for the data page in question. It turns out there can be unused bytes after this - but that the total length of the definitions block is correctly specified, so the extra bytes can simply be skipped.

Appropriate for definitions reading with extra junk

If definitions has extra unused bytes, seek to right location

dscottcs · 2017-02-03T20:58:44Z

Thanks for the tip. Will test now.

dscottcs · 2017-02-03T21:25:12Z

Not sure how to fix the dependency issue with python-snappy on Py3.6. I'm able to test locally without issues.

martindurant · 2017-02-03T21:49:51Z

Raised an issue. I'm not sure who sees this
conda-forge/python-snappy-feedstock#4

We can give them a little time to respond, and continue with py35/27 only if we hear nothing.

martindurant · 2017-02-03T22:18:25Z

OK, I built it myself, that can do for the time being:
- conda install -c mdurant python-snappy

(Probably) temporary stopgap to get Python3.6 tests to work.

martindurant · 2017-02-03T23:03:07Z

fastparquet/converted_types.py

+                return np.array([int.from_bytes(d, byteorder='big', signed=True) *
+                                 scale_factor for d in data])
+
+            return np.array([int(str(d).encode('hex'), 16) * scale_factor for d in data])


This line never gets called.

pitrou · 2017-02-04T18:10:39Z

fastparquet/compression.py

+    def gzip_decompress_v2(data):
+        import zlib
+        return zlib.decompress(data,
+                               16+15)


Perhaps you could comment on this magic number?

I agree it's a bit arcane. It has to do with zlib 'windowBits' parameter. From the zlib manual:

The windowBits parameter is the base two logarithm of the window size (the size of the history buffer). It should be in the range 8..15 for this version of the library. Larger values of this parameter result in better compression at the expense of memory usage. The default value is 15 if deflateInit is used instead.

The '16' is a base value that indicates that gzip decompression must take place. The '15' is a maximum window size, as indicated above.

I can add a brief comment to the code.

Yes, please add a comment!

pitrou · 2017-02-04T18:16:19Z

fastparquet/converted_types.py

-                             scale_factor for d in data])
+            if PY2:
+                def from_bytes(d):
+                    return int(codecs.encode(d, 'hex'), 16) if len(d) else 0


It seems binascii.b2a_hex(d) is faster than codecs.encode(d, 'hex'). I'd also add it's more well-known :-)

For reference:

>>> d = bytes(bytearray(range(24))) >>> %timeit int(codecs.encode(d, 'hex'), 16) if len(d) else 0 1000000 loops, best of 3: 1.49 µs per loop >>> %timeit int(binascii.b2a_hex(d), 16) if len(d) else 0 1000000 loops, best of 3: 729 ns per loop

Excellent. I will use the faster method.

pitrou · 2017-02-04T18:18:08Z

fastparquet/core.py

@@ -126,7 +126,7 @@ def read_data_page(f, helper, header, metadata, skip_nulls=False,
            num = (encoding.read_unsigned_var_int(io_obj) >> 1) * 8
            values = io_obj.read(num * bit_width // 8).view('int%i' % bit_width)
        elif bit_width:
-            values = encoding.Numpy32(np.zeros(daph.num_values,
+            values = encoding.Numpy32(np.empty(daph.num_values-num_nulls+7,


I'm a bit surprised, does your PR fix bugs in addition to making the code base 2.7-compatible?

Not sure how this ended up in my PR. This was a change by Martin on 1/27/2017 (according to git blame). May be a merge error.

Yes, I recognize that, the green line is the correct one.

pitrou · 2017-02-04T18:20:32Z

fastparquet/test/test_output.py

@@ -1,3 +1,5 @@
+# -*- coding: utf-8 -*-
+from __future__ import unicode_literals


I don't really like this. Python 2-compatible APis should accept Python 2 str in most places (for example column names).

Fair enough. The problem was unicode literals in some of the assertions. Setting these explicitly to unicode seems to solve the problem.

pitrou · 2017-02-04T18:25:02Z

fastparquet/util.py

+        try:
+            b = bytes(s)
+        except UnicodeEncodeError as e:
+            u = u''.join((s)).encode('utf-8').strip()


Er, what is the reasoning behind this line?
Let me suggest a different implementation:

if PY2: def ensure_bytes(s): return s.encode('utf-8') if isinstance(s, unicode) else s else: def ensure_bytes(s): return s.encode('utf-8') if isinstance(s, str) else s

pitrou · 2017-02-04T18:25:38Z

fastparquet/util.py

+    return bytearray(raw_bytes) if PY2 else memoryview(raw_bytes)
+
+
+def str_type():


The code here is ok, but wouldn't it be simpler to vendor six (https://pythonhosted.org/six/) instead of reimplementing this kind of thing?
(also, this needn't be a function call)

pitrou · 2017-02-04T18:28:12Z

fastparquet/util.py

+
+
+def byte_buffer(raw_bytes):
+    return bytearray(raw_bytes) if PY2 else memoryview(raw_bytes)


The motivation for this is a bit unclear, or at least the semantics are a bit vague. memoryview produces a (possible writable) view, while bytearray produces a mutable copy.

If this is about np.frombuffer, then it's true that it doesn't accept memoryview on Python 2, but OTOH it seems to accept the old-style buffer object.

pitrou · 2017-02-04T18:29:15Z

fastparquet/util.py

+def check_column_names(columns, *args):
+    """Ensure that parameters listing column names have corresponding columns"""
+    for arg in args:
+        if isinstance(arg, (tuple, list)):


What if it's not a tuple or list? Just ignored? And why are all args tested one by one, instead of a single of them?

Another of Martin's commits - from 1/30/2017. Again, not sure how it ended up in my PR.

Things that are not tuple or list do automatically pass, by design.
A few possible arguments can be bool or string as well as a list of columns.
(yes, this is my code)

If that's not part of the PR then my comments are probably OT. Though I must add that what this function is supposed to do, and why it is coded as it is, is still a mystery to me.

After this PR is merged, I intend to package for release, so I can do a spot of doc-string updating at that point.

pitrou · 2017-02-04T18:30:13Z

fastparquet/writer.py

@@ -137,7 +140,10 @@ def convert(data, se):
        out = data.values
    elif dtype == "O":
        if converted_type == parquet_thrift.ConvertedType.UTF8:
-            out = array_encode_utf8(data)
+            if PY2:
+                out = np.array([x.encode('utf8') for x in data], dtype="O")


Ideally array_encode_utf8 should be fixed instead, though I don't have a problem personally if Python 2 takes a backseat in terms of performance :-)

No such luck. array_encode_utf8 is not friendly to Python2 at all. The performance issue is a known limitation, at least for now.

This does not appear to be true:

fastparquet.speedups.array_encode_utf8(np.array([u'ef¬∆∫˚'], dtype="O")) array(['ef\xc2\xac\xe2\x88\x86\xe2\x88\xab\xcb\x9a'], dtype=object)

I suspect, again, that the difference is explicitly labeling the input as unicode. Without that, it would be the equivalent of py3's bytes.

mrocklin

I added some small comments here, but generally a +1 to @pitrou 's comments.

mrocklin · 2017-02-04T21:12:09Z

fastparquet/dataframe.py

@@ -37,7 +38,7 @@ def empty(types, size, cats=None, cols=None, index_type=None, index_name=None):
    views = {}

    cols = cols if cols is not None else range(cols)
-    if isinstance(types, str):
+    if isinstance(types, str_type()):


It would be nice not to have to call anything here. I recommend doing the check at import time

if PY2: str_type = ... else: str_type = ... from .util import str_type

Also we might want to create a separate compatibility.py files rather than util. Or, as @pitrou suggests, simply use six.

Fixed by falling back to the six module.

mrocklin · 2017-02-04T21:15:01Z

fastparquet/util.py

+        if not os.path.exists(f):
+            os.makedirs(f)
+    else:
+        os.makedirs(f, exist_ok=True)


Again, would prefer checks like this to happen at import time

if PY2: def makedirs(...): ... else: def makedirs(...): ...

OK fair enough.

martindurant · 2017-02-05T15:44:39Z

@dscottcs , python-snappy is now available in conda-forge conda-forge/python-snappy-feedstock#5 , so the last change to .travis.yml can be reverted.

pitrou · 2017-02-07T10:31:07Z

@dscottcs, could you try merging from master, so that unrelated changes disappear from the PR? Thank you!

martindurant · 2017-02-08T15:34:35Z

Is six now a dependency?

Changes to support Python 2.7

d615b9f

- write custom 'compress' and 'decompress' functions for gzip - use bytearray instead of memoryview for byte arrays - default mkdir to 'mkdir -p' subprocess call

mrocklin mentioned this pull request Jan 5, 2017

Python 2 compatibility? #65

Closed

Use explicit Python2 flag instead of try/except

0cd7608

martindurant mentioned this pull request Jan 6, 2017

include gzip compress, decompress functions #63

Closed

martindurant reviewed Jan 6, 2017

View reviewed changes

dscottcs added 6 commits January 9, 2017 21:39

Simulate exist_ok = True for Python2.7 makedirs

2c6165a

Some changes to make unit tests work with Python2.7

9eea07d

Also added Python2 support to .travis.yml

Add some code omitted by mistake

c61f45d

Merge branch 'master' into fix/python2.7_support

2937efc

Include .pyx files in MANIFEST.in

8a86848

Partially disable speedups for Python2

fb0398e

Also disable Python2 speedup tests

martindurant reviewed Jan 10, 2017

View reviewed changes

mrocklin reviewed Jan 10, 2017

View reviewed changes

Martin Durant added 3 commits January 11, 2017 09:59

Unify read_col loops

5454ca4

Simplifies paths, reduces memory footprint when there are multiple pages per row-group

Park here

59ea876

Enter fix for dask@dfd025f#commitcomment-20375584

Alloz zero-row dataframes to be written

6a39181

Fixes dask#68

martindurant added 2 commits January 11, 2017 17:02

Merge pull request dask#71 from martindurant/empty_df

230a2f2

Alloz zero-row dataframes to be written

Merge pull request dask#69 from martindurant/easier_assign

5289dfb

Easier assign

martindurant mentioned this pull request Jan 16, 2017

Update dependencies #73

Closed

Martin Durant and others added 4 commits January 28, 2017 18:03

Import tobson, otherwise it's undefined.

5358727

Add extra bytes test

aea8fc5

Appropriate for definitions reading with extra junk

Merge pull request dask#77 from martindurant/bytes_allignment_fix

da0bfc3

If definitions has extra unused bytes, seek to right location

dscottcs added 3 commits February 3, 2017 20:59

Install thriftpy from conda-forge

b1e495a

Install python-snappy from conda-forge

1288499

Use unicode strings for python2 and python3 tests

ae49fa8

Use custom conda source for python-snappy

3939d52

(Probably) temporary stopgap to get Python3.6 tests to work.

martindurant reviewed Feb 3, 2017

View reviewed changes

Remove leftover cruft that never gets executed.

c4c2118

pitrou reviewed Feb 4, 2017

View reviewed changes

mrocklin reviewed Feb 4, 2017

View reviewed changes

dscottcs added 2 commits February 7, 2017 01:29

Various fixes in response to comments.

df3012c

Pull python-snappy from conda-forge

4cfa827

This was referenced Feb 9, 2017

Add ability to output Hive/Impala compatible timestamps #82

Closed

Python2 support? #21

Closed

python2.7 support #87

Merged

martindurant merged commit 4cfa827 into dask:master Feb 16, 2017

dscottcs deleted the fix/python2.7_support branch March 28, 2017 20:46

		@@ -1,3 +1,5 @@
		# -- coding: utf-8 --
		from __future__ import unicode_literals

		return bytearray(raw_bytes) if PY2 else memoryview(raw_bytes)


		def str_type():



		def byte_buffer(raw_bytes):
		return bytearray(raw_bytes) if PY2 else memoryview(raw_bytes)

Python 2.7 support #66

Python 2.7 support #66

Conversation

dscottcs commented Jan 4, 2017

mrocklin commented Jan 4, 2017

Choose a reason for hiding this comment

dscottcs commented Jan 9, 2017

Choose a reason for hiding this comment

martindurant commented Jan 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Jan 11, 2017

martindurant commented Jan 19, 2017

dscottcs commented Jan 19, 2017 via email

dscottcs commented Feb 3, 2017

dscottcs commented Feb 3, 2017

martindurant commented Feb 3, 2017 • edited Loading

martindurant commented Feb 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dscottcs Feb 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou Feb 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant Feb 7, 2017 • edited Loading

Choose a reason for hiding this comment

mrocklin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Feb 5, 2017

pitrou commented Feb 7, 2017

martindurant commented Feb 8, 2017

martindurant commented Feb 3, 2017 •

edited

Loading

dscottcs Feb 6, 2017 •

edited

Loading

pitrou Feb 4, 2017 •

edited

Loading

martindurant Feb 7, 2017 •

edited

Loading