Auto text convert #30

martindurant · 2016-11-18T22:22:21Z

Fixes #24

Fixed a number of tests and inconsistencies on the way

mrocklin · 2016-11-19T13:50:34Z

fastparquet/converted_types.py

    elif ctype == parquet_thrift.ConvertedType.TIMESTAMP_MICROS:
-        return pd.to_datetime(data, unit='us', box=False)
+        return pd.to_datetime(data, unit='us')


If we know that there are no NaTs here then can we instead do the following:

pd.DatetimeIndex(data, copy=False, dtype='M8[us]')

This should run in almost no time.

We do know that in the function above, read_col.
Actually just checked, and the special NaT value (pd.to_datetime('NaT').value) can be passed totally fine.

In [1]: import numpy as np In [2]: x = np.arange(10000000) In [3]: import pandas as pd In [4]: %time _ = pd.to_datetime(x, unit='us') CPU times: user 80 ms, sys: 28 ms, total: 108 ms Wall time: 104 ms In [5]: %time _ = pd.DatetimeIndex(x, dtype='M8[us]', copy=False) CPU times: user 4 ms, sys: 0 ns, total: 4 ms Wall time: 992 µs

mrocklin · 2016-11-19T14:00:08Z

fastparquet/converted_types.py

@@ -115,6 +115,7 @@ def convert(data, se):
    elif ctype == parquet_thrift.ConvertedType.INTERVAL:
        # for those that understand, output is month, day, ms
        # maybe should convert to timedelta
+        # TODO: seems like a np.view should do this much faster


I was curious what was going on here so I inserted a PDB statement and ran tests to trigger a situation where this code would be active with data. It looks like this code wasn't covered by the test suite.

This is a special type I haven't seen yet. Given TIME_MILLIS and DATE, I don't know why anyone would want to use this different kind of compound interval.

My understanding is that interval is used for time ranges, like "from 12:00 to 1:00pm". Not sure what the state of Pandas support is for this. cc @TomAugspurger

We can always just raise NotImplementedError as well

Pandas doesn't have anything like a time span "12:00 to 1:00 PM" but we do have the Period type that represents the datetime span 2016-01-10T12:00:00 to 2016-01-01T13:00:00;

In [10]: pd.Series(pd.period_range('2016-01-01', freq='H', periods=6)) Out[10]: 0 2016-01-01 00:00 1 2016-01-01 01:00 2 2016-01-01 02:00 3 2016-01-01 03:00 4 2016-01-01 04:00 5 2016-01-01 05:00 dtype: object

Without having read the parquet docs, I don't know whether that, or a timedelta, or something else is more appropriate. That is a fixed frequency for the entire array.

mrocklin · 2016-11-19T14:06:44Z

fastparquet/test/test_writer.py

+    df2 = pf.to_pandas()
+    tm.assert_frame_equal(df, df2)
+
+    write(fn, df, fixed_text=False)


fixed_length_text?

martindurant · 2016-11-19T14:08:16Z

I think in read_col, I may have been creating the intermediate array for any time column as a float64, because they are integers, and integers cannot have NaN, and this is what pandas does (but times can have NaT). So the slow conversion was actually float64->datetime64.

mrocklin · 2016-11-19T14:10:38Z

A trivial benchmark:

In [9]: %time fastparquet.write('notfixed.parquet', df, fixed_text=False)
CPU times: user 916 ms, sys: 24 ms, total: 940 ms
Wall time: 939 ms

In [10]: %time fastparquet.write('fixed.parquet', df, fixed_text=True)
CPU times: user 1.22 s, sys: 12 ms, total: 1.23 s
Wall time: 1.24 s

In [11]: !du -hs fixed.parquet
980K    fixed.parquet

In [12]: !du -hs notfixed.parquet
4.8M    notfixed.parquet

In [13]: %time _ = ParquetFile('fixed.parquet').to_pandas()
CPU times: user 1.32 s, sys: 0 ns, total: 1.32 s
Wall time: 1.34 s

In [14]: %time _ = ParquetFile('notfixed.parquet').to_pandas()
CPU times: user 808 ms, sys: 8 ms, total: 816 ms
Wall time: 817 ms

In this naive case it looks like fixed is slower but more compact.

mrocklin · 2016-11-19T14:11:41Z

Fixed reading gets faster when using compression

In [15]: %time fastparquet.write('fixed.parquet', df, fixed_text=True, compression='snappy')
CPU times: user 1.2 s, sys: 8 ms, total: 1.2 s
Wall time: 1.2 s

In [16]: %time fastparquet.write('notfixed.parquet', df, fixed_text=False, compression='snappy')
CPU times: user 920 ms, sys: 12 ms, total: 932 ms
Wall time: 938 ms

In [17]: %time _ = ParquetFile('notfixed.parquet').to_pandas()
CPU times: user 812 ms, sys: 0 ns, total: 812 ms
Wall time: 810 ms

In [18]: %time _ = ParquetFile('fixed.parquet').to_pandas()
CPU times: user 388 ms, sys: 8 ms, total: 396 ms
Wall time: 398 ms

mrocklin · 2016-11-19T14:13:38Z

I think in read_col, I may have been creating the intermediate array for any time column as a float64, because they are integers, and integers cannot have NaN, and this is what pandas does (but times can have NaT). So the slow conversion was actually float64->datetime64.

Should we special case non-nullable datetimes?

martindurant · 2016-11-19T14:27:51Z

Fixed also gets much faster when there are no nulls (which is where my intuition came from) - but note that the input here is still Object, so there is a conversion on write to S type in the fixed case

In [18]: %time fastparquet.write('unfixed.parq', x, fixed_text=False, has_nulls=False)
CPU times: user 511 ms, sys: 60.5 ms, total: 572 ms
Wall time: 573 ms

In [19]: %time fastparquet.write('fixed.parq', x, has_nulls=False)
CPU times: user 478 ms, sys: 26.2 ms, total: 505 ms
Wall time: 504 ms

In [20]: pf = fastparquet.ParquetFile('unfixed.parq')

In [21]: %time out = pf.to_pandas()
CPU times: user 403 ms, sys: 12.8 ms, total: 415 ms
Wall time: 417 ms

In [22]: pf = fastparquet.ParquetFile('fixed.parq')

In [23]: %time out = pf.to_pandas()
CPU times: user 16.7 ms, sys: 5.42 ms, total: 22.1 ms
Wall time: 21.5 ms

In [24]: lf -l *fixed.parq
-rw-r--r--  1 mdurant  staff  1000208 19 Nov 09:25 fixed.parq
-rw-r--r--  1 mdurant  staff  5000231 19 Nov 09:25 unfixed.parq

martindurant · 2016-11-19T14:31:45Z

One problem is calling find_type twice: once to get the schema definition and once to convert the column; both need to calculate the maximum string length in the data; but this logic should be skipped, and I realise is potentially wrong if different row groups have different maximum string lengths.

mrocklin · 2016-11-19T14:33:00Z

Thoughts on setting has_nulls='auto' by default and then checking?

In [24]: %time df.x.isnull().any()
CPU times: user 52 ms, sys: 0 ns, total: 52 ms
Wall time: 54.3 ms
Out[24]: False

(or perhaps df.x.hasnans)

It might be worth setting (not)nullable metadata even on floats so that other tools know how to handle things.

mrocklin · 2016-11-19T14:34:22Z

One problem is calling find_type twice: once to get the schema definition and once to convert the column; both need to calculate the maximum string length in the data; but this logic should be skipped, and I realise is potentially wrong if different row groups have different maximum string lengths.

Can we defer this responsibility to the user short term? I feel like in the case of fixed length strings users might have a pretty good idea of their string lengths like "always 1" or "always 4".

martindurant · 2016-11-19T14:55:30Z

We have a few issues interleaved here. I suggest we regroup on Monday.

For now, I think asking the user for string length when using fixed strings is fine (another keyword?), which is sort of the state before this PR, where they had to convert to S by hand.

I see that hasnans is much faster than isnull().any(), but doesn't actually seems to pick up a None in an object column.

martindurant · 2016-11-19T16:32:01Z

This is the speed-up that I was hoping to achieve - here with relatively long strings and no compression/disc access:

In [24]: x = np.random.choice([b'hello', b'oi', b'lincolnshire'], size=10000000).astype("O")

In [25]: %%time
    ...: out = b"".join([struct.pack('<i', len(s)) + s for s in x])
    ...: o = fastparquet.encoding.read_plain(out, 6, 10000000, None)
    ...:
CPU times: user 8.18 s, sys: 746 ms, total: 8.92 s
Wall time: 8.92 s

In [26]: %%time
    ...: out3 = x.astype("S").tostring()
    ...: o = fastparquet.encoding.read_plain(out3, 7, 10000000, 12)
    ...:
CPU times: user 726 ms, sys: 144 ms, total: 870 ms
Wall time: 869 ms

Note that the output of the latter is |S12, and pandas would convert that back to object as soon as you do anything to the column aside from selecting. Tagging .astype("O") on the end makes the time go up to 1.23s. We of course need to work with pandas series to handle NULLs and conversion to/from UTF8.

Interestingly, a big chunk of the first method is in the struct and adding up the parts of the bytes before the final join: b"".join(x) alone takes 0.9s only, hence my interest in the "delta binary array" encoding, where you store the set of lengths separately. I can't think of a way to numba this faster, but perhaps someone else can.

mrocklin · 2016-11-19T16:52:39Z

This is the speed-up that I was hoping to achieve - here with relatively long strings and no compression/disc access

How common is the relatively long string case? I just want to make sure that we're optimizing our optimization efforts.

Interestingly, a big chunk of the first method is in the struct and adding up the parts of the bytes before the final join: b"".join(x) alone takes 0.9s only, hence my interest in the "delta binary array" encoding, where you store the set of lengths separately. I can't think of a way to numba this faster, but perhaps someone else can.

To make sure I understand, you're finding that adding bytestrings is taking up most of the time, yes? I suppose each addition creates a new Python object. It might be nicer to pre-allocate an array of bytes and then insert the bytes into the correct locations. Not sure. I just tried this a few ways and didn't get any better results.

Also, for comparision, here's cythonized msgpack:

In [40]: %time len(pd.msgpack.dumps(x.tolist(), use_bin_type=True))
CPU times: user 608 ms, sys: 80 ms, total: 688 ms
Wall time: 693 ms
Out[40]: 83333681

martindurant · 2016-11-20T23:19:08Z

On commonness of long strings: the speed-up factor is more pronounced for shorter byte-strings.

Correct, the biggest contributor is the + operation between the bytestrings within the comprehension. Unfortunately, I don't think there's a way within numba to access the raw string referenced as a python string object within an object array or list.

Due to the unecessary overhead of having to calculat the maximum string type int he schema and for the column conversion, now explicitly require array string length in order to do object->fixed conversion.

mrocklin · 2016-11-22T16:03:14Z

I like the fixed length text changes that make it an explicit input from the user. The API seems intuitive to me. I have a slight preference for something more verbose, like fixed_length_text but can see how that is also fairly long.

Martin Durant added 7 commits November 16, 2016 16:50

plain stats encoding

7b2c986

Merge branch 'master' of https://github.com/dask/fastparquet

9154904

Merge branch 'master' of https://github.com/dask/fastparquet

f987dc7

Merge branch 'master' of https://github.com/dask/fastparquet

535899d

Merge branch 'master' of https://github.com/dask/fastparquet

d5021ad

Default to converting text (str, bytes) columns to fixed-width

d86b61b

Fixed a number of tests and inconsistencies on the way

Update min/max test

293cc8f

mrocklin reviewed Nov 19, 2016

View reviewed changes

martindurant mentioned this pull request Nov 21, 2016

Optimizations #31

Closed

4 tasks

Require length from the user

432c104

Due to the unecessary overhead of having to calculat the maximum string type int he schema and for the column conversion, now explicitly require array string length in order to do object->fixed conversion.

martindurant merged commit f8a8cea into dask:master Nov 22, 2016

martindurant deleted the auto_text_convert branch November 22, 2016 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto text convert #30

Auto text convert #30

martindurant commented Nov 18, 2016

mrocklin Nov 19, 2016

martindurant Nov 19, 2016

mrocklin Nov 19, 2016

mrocklin Nov 19, 2016

martindurant Nov 19, 2016

mrocklin Nov 19, 2016

mrocklin Nov 19, 2016

TomAugspurger Nov 19, 2016 •

edited

Loading

mrocklin Nov 19, 2016

martindurant commented Nov 19, 2016

mrocklin commented Nov 19, 2016

mrocklin commented Nov 19, 2016

mrocklin commented Nov 19, 2016

martindurant commented Nov 19, 2016

martindurant commented Nov 19, 2016

mrocklin commented Nov 19, 2016

mrocklin commented Nov 19, 2016

martindurant commented Nov 19, 2016

martindurant commented Nov 19, 2016 •

edited

Loading

mrocklin commented Nov 19, 2016

martindurant commented Nov 20, 2016

mrocklin commented Nov 22, 2016

Auto text convert #30

Auto text convert #30

Conversation

martindurant commented Nov 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Nov 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Nov 19, 2016

mrocklin commented Nov 19, 2016

mrocklin commented Nov 19, 2016

mrocklin commented Nov 19, 2016

martindurant commented Nov 19, 2016

martindurant commented Nov 19, 2016

mrocklin commented Nov 19, 2016

mrocklin commented Nov 19, 2016

martindurant commented Nov 19, 2016

martindurant commented Nov 19, 2016 • edited Loading

mrocklin commented Nov 19, 2016

martindurant commented Nov 20, 2016

mrocklin commented Nov 22, 2016

TomAugspurger Nov 19, 2016 •

edited

Loading

martindurant commented Nov 19, 2016 •

edited

Loading