Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto text convert #30

Merged
merged 8 commits into from
Nov 22, 2016
Merged

Conversation

martindurant
Copy link
Member

Fixes #24

elif ctype == parquet_thrift.ConvertedType.TIMESTAMP_MICROS:
return pd.to_datetime(data, unit='us', box=False)
return pd.to_datetime(data, unit='us')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know that there are no NaTs here then can we instead do the following:

pd.DatetimeIndex(data, copy=False, dtype='M8[us]')

This should run in almost no time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do know that in the function above, read_col.
Actually just checked, and the special NaT value (pd.to_datetime('NaT').value) can be passed totally fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [1]: import numpy as np

In [2]: x = np.arange(10000000)

In [3]: import pandas as pd

In [4]: %time _ = pd.to_datetime(x, unit='us')
CPU times: user 80 ms, sys: 28 ms, total: 108 ms
Wall time: 104 ms

In [5]: %time _ = pd.DatetimeIndex(x, dtype='M8[us]', copy=False)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 992 µs

@@ -115,6 +115,7 @@ def convert(data, se):
elif ctype == parquet_thrift.ConvertedType.INTERVAL:
# for those that understand, output is month, day, ms
# maybe should convert to timedelta
# TODO: seems like a np.view should do this much faster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious what was going on here so I inserted a PDB statement and ran tests to trigger a situation where this code would be active with data. It looks like this code wasn't covered by the test suite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a special type I haven't seen yet. Given TIME_MILLIS and DATE, I don't know why anyone would want to use this different kind of compound interval.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that interval is used for time ranges, like "from 12:00 to 1:00pm". Not sure what the state of Pandas support is for this. cc @TomAugspurger

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always just raise NotImplementedError as well

Copy link
Member

@TomAugspurger TomAugspurger Nov 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas doesn't have anything like a time span "12:00 to 1:00 PM" but we do have the Period type that represents the datetime span 2016-01-10T12:00:00 to 2016-01-01T13:00:00;

In [10]: pd.Series(pd.period_range('2016-01-01', freq='H', periods=6))
Out[10]:
0   2016-01-01 00:00
1   2016-01-01 01:00
2   2016-01-01 02:00
3   2016-01-01 03:00
4   2016-01-01 04:00
5   2016-01-01 05:00
dtype: object

Without having read the parquet docs, I don't know whether that, or a timedelta, or something else is more appropriate. That is a fixed frequency for the entire array.

df2 = pf.to_pandas()
tm.assert_frame_equal(df, df2)

write(fn, df, fixed_text=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed_length_text?

@martindurant
Copy link
Member Author

I think in read_col, I may have been creating the intermediate array for any time column as a float64, because they are integers, and integers cannot have NaN, and this is what pandas does (but times can have NaT). So the slow conversion was actually float64->datetime64.

@mrocklin
Copy link
Member

A trivial benchmark:

In [9]: %time fastparquet.write('notfixed.parquet', df, fixed_text=False)
CPU times: user 916 ms, sys: 24 ms, total: 940 ms
Wall time: 939 ms

In [10]: %time fastparquet.write('fixed.parquet', df, fixed_text=True)
CPU times: user 1.22 s, sys: 12 ms, total: 1.23 s
Wall time: 1.24 s

In [11]: !du -hs fixed.parquet
980K    fixed.parquet

In [12]: !du -hs notfixed.parquet
4.8M    notfixed.parquet

In [13]: %time _ = ParquetFile('fixed.parquet').to_pandas()
CPU times: user 1.32 s, sys: 0 ns, total: 1.32 s
Wall time: 1.34 s

In [14]: %time _ = ParquetFile('notfixed.parquet').to_pandas()
CPU times: user 808 ms, sys: 8 ms, total: 816 ms
Wall time: 817 ms

In this naive case it looks like fixed is slower but more compact.

@mrocklin
Copy link
Member

Fixed reading gets faster when using compression

In [15]: %time fastparquet.write('fixed.parquet', df, fixed_text=True, compression='snappy')
CPU times: user 1.2 s, sys: 8 ms, total: 1.2 s
Wall time: 1.2 s

In [16]: %time fastparquet.write('notfixed.parquet', df, fixed_text=False, compression='snappy')
CPU times: user 920 ms, sys: 12 ms, total: 932 ms
Wall time: 938 ms

In [17]: %time _ = ParquetFile('notfixed.parquet').to_pandas()
CPU times: user 812 ms, sys: 0 ns, total: 812 ms
Wall time: 810 ms

In [18]: %time _ = ParquetFile('fixed.parquet').to_pandas()
CPU times: user 388 ms, sys: 8 ms, total: 396 ms
Wall time: 398 ms

@mrocklin
Copy link
Member

I think in read_col, I may have been creating the intermediate array for any time column as a float64, because they are integers, and integers cannot have NaN, and this is what pandas does (but times can have NaT). So the slow conversion was actually float64->datetime64.

Should we special case non-nullable datetimes?

@martindurant
Copy link
Member Author

Fixed also gets much faster when there are no nulls (which is where my intuition came from) - but note that the input here is still Object, so there is a conversion on write to S type in the fixed case

In [18]: %time fastparquet.write('unfixed.parq', x, fixed_text=False, has_nulls=False)
CPU times: user 511 ms, sys: 60.5 ms, total: 572 ms
Wall time: 573 ms

In [19]: %time fastparquet.write('fixed.parq', x, has_nulls=False)
CPU times: user 478 ms, sys: 26.2 ms, total: 505 ms
Wall time: 504 ms

In [20]: pf = fastparquet.ParquetFile('unfixed.parq')

In [21]: %time out = pf.to_pandas()
CPU times: user 403 ms, sys: 12.8 ms, total: 415 ms
Wall time: 417 ms

In [22]: pf = fastparquet.ParquetFile('fixed.parq')

In [23]: %time out = pf.to_pandas()
CPU times: user 16.7 ms, sys: 5.42 ms, total: 22.1 ms
Wall time: 21.5 ms

In [24]: lf -l *fixed.parq
-rw-r--r--  1 mdurant  staff  1000208 19 Nov 09:25 fixed.parq
-rw-r--r--  1 mdurant  staff  5000231 19 Nov 09:25 unfixed.parq

@martindurant
Copy link
Member Author

One problem is calling find_type twice: once to get the schema definition and once to convert the column; both need to calculate the maximum string length in the data; but this logic should be skipped, and I realise is potentially wrong if different row groups have different maximum string lengths.

@mrocklin
Copy link
Member

Thoughts on setting has_nulls='auto' by default and then checking?

In [24]: %time df.x.isnull().any()
CPU times: user 52 ms, sys: 0 ns, total: 52 ms
Wall time: 54.3 ms
Out[24]: False

(or perhaps df.x.hasnans)

It might be worth setting (not)nullable metadata even on floats so that other tools know how to handle things.

@mrocklin
Copy link
Member

One problem is calling find_type twice: once to get the schema definition and once to convert the column; both need to calculate the maximum string length in the data; but this logic should be skipped, and I realise is potentially wrong if different row groups have different maximum string lengths.

Can we defer this responsibility to the user short term? I feel like in the case of fixed length strings users might have a pretty good idea of their string lengths like "always 1" or "always 4".

@martindurant
Copy link
Member Author

We have a few issues interleaved here. I suggest we regroup on Monday.

For now, I think asking the user for string length when using fixed strings is fine (another keyword?), which is sort of the state before this PR, where they had to convert to S by hand.

I see that hasnans is much faster than isnull().any(), but doesn't actually seems to pick up a None in an object column.

@martindurant
Copy link
Member Author

martindurant commented Nov 19, 2016

This is the speed-up that I was hoping to achieve - here with relatively long strings and no compression/disc access:

In [24]: x = np.random.choice([b'hello', b'oi', b'lincolnshire'], size=10000000).astype("O")

In [25]: %%time
    ...: out = b"".join([struct.pack('<i', len(s)) + s for s in x])
    ...: o = fastparquet.encoding.read_plain(out, 6, 10000000, None)
    ...:
CPU times: user 8.18 s, sys: 746 ms, total: 8.92 s
Wall time: 8.92 s

In [26]: %%time
    ...: out3 = x.astype("S").tostring()
    ...: o = fastparquet.encoding.read_plain(out3, 7, 10000000, 12)
    ...:
CPU times: user 726 ms, sys: 144 ms, total: 870 ms
Wall time: 869 ms

Note that the output of the latter is |S12, and pandas would convert that back to object as soon as you do anything to the column aside from selecting. Tagging .astype("O") on the end makes the time go up to 1.23s. We of course need to work with pandas series to handle NULLs and conversion to/from UTF8.

Interestingly, a big chunk of the first method is in the struct and adding up the parts of the bytes before the final join: b"".join(x) alone takes 0.9s only, hence my interest in the "delta binary array" encoding, where you store the set of lengths separately. I can't think of a way to numba this faster, but perhaps someone else can.

@mrocklin
Copy link
Member

This is the speed-up that I was hoping to achieve - here with relatively long strings and no compression/disc access

How common is the relatively long string case? I just want to make sure that we're optimizing our optimization efforts.

Interestingly, a big chunk of the first method is in the struct and adding up the parts of the bytes before the final join: b"".join(x) alone takes 0.9s only, hence my interest in the "delta binary array" encoding, where you store the set of lengths separately. I can't think of a way to numba this faster, but perhaps someone else can.

To make sure I understand, you're finding that adding bytestrings is taking up most of the time, yes? I suppose each addition creates a new Python object. It might be nicer to pre-allocate an array of bytes and then insert the bytes into the correct locations. Not sure. I just tried this a few ways and didn't get any better results.

Also, for comparision, here's cythonized msgpack:

In [40]: %time len(pd.msgpack.dumps(x.tolist(), use_bin_type=True))
CPU times: user 608 ms, sys: 80 ms, total: 688 ms
Wall time: 693 ms
Out[40]: 83333681

@martindurant
Copy link
Member Author

On commonness of long strings: the speed-up factor is more pronounced for shorter byte-strings.

Correct, the biggest contributor is the + operation between the bytestrings within the comprehension. Unfortunately, I don't think there's a way within numba to access the raw string referenced as a python string object within an object array or list.

@martindurant martindurant mentioned this pull request Nov 21, 2016
4 tasks
Due to the unecessary overhead of having to calculat the maximum string
type int he schema and for the column conversion, now explicitly require
array string length in order to do object->fixed conversion.
@mrocklin
Copy link
Member

I like the fixed length text changes that make it an explicit input from the user. The API seems intuitive to me. I have a slight preference for something more verbose, like fixed_length_text but can see how that is also fairly long.

@martindurant martindurant merged commit f8a8cea into dask:master Nov 22, 2016
@martindurant martindurant deleted the auto_text_convert branch November 22, 2016 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants