-
-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto text convert #30
Conversation
Fixed a number of tests and inconsistencies on the way
elif ctype == parquet_thrift.ConvertedType.TIMESTAMP_MICROS: | ||
return pd.to_datetime(data, unit='us', box=False) | ||
return pd.to_datetime(data, unit='us') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we know that there are no NaTs here then can we instead do the following:
pd.DatetimeIndex(data, copy=False, dtype='M8[us]')
This should run in almost no time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do know that in the function above, read_col.
Actually just checked, and the special NaT value (pd.to_datetime('NaT').value
) can be passed totally fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [1]: import numpy as np
In [2]: x = np.arange(10000000)
In [3]: import pandas as pd
In [4]: %time _ = pd.to_datetime(x, unit='us')
CPU times: user 80 ms, sys: 28 ms, total: 108 ms
Wall time: 104 ms
In [5]: %time _ = pd.DatetimeIndex(x, dtype='M8[us]', copy=False)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 992 µs
@@ -115,6 +115,7 @@ def convert(data, se): | |||
elif ctype == parquet_thrift.ConvertedType.INTERVAL: | |||
# for those that understand, output is month, day, ms | |||
# maybe should convert to timedelta | |||
# TODO: seems like a np.view should do this much faster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious what was going on here so I inserted a PDB statement and ran tests to trigger a situation where this code would be active with data. It looks like this code wasn't covered by the test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a special type I haven't seen yet. Given TIME_MILLIS and DATE, I don't know why anyone would want to use this different kind of compound interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that interval is used for time ranges, like "from 12:00 to 1:00pm". Not sure what the state of Pandas support is for this. cc @TomAugspurger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always just raise NotImplementedError
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas doesn't have anything like a time span "12:00 to 1:00 PM" but we do have the Period
type that represents the datetime span 2016-01-10T12:00:00 to 2016-01-01T13:00:00;
In [10]: pd.Series(pd.period_range('2016-01-01', freq='H', periods=6))
Out[10]:
0 2016-01-01 00:00
1 2016-01-01 01:00
2 2016-01-01 02:00
3 2016-01-01 03:00
4 2016-01-01 04:00
5 2016-01-01 05:00
dtype: object
Without having read the parquet docs, I don't know whether that, or a timedelta, or something else is more appropriate. That is a fixed frequency for the entire array.
df2 = pf.to_pandas() | ||
tm.assert_frame_equal(df, df2) | ||
|
||
write(fn, df, fixed_text=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed_length_text?
I think in read_col, I may have been creating the intermediate array for any time column as a float64, because they are integers, and integers cannot have NaN, and this is what pandas does (but times can have NaT). So the slow conversion was actually float64->datetime64. |
A trivial benchmark: In [9]: %time fastparquet.write('notfixed.parquet', df, fixed_text=False)
CPU times: user 916 ms, sys: 24 ms, total: 940 ms
Wall time: 939 ms
In [10]: %time fastparquet.write('fixed.parquet', df, fixed_text=True)
CPU times: user 1.22 s, sys: 12 ms, total: 1.23 s
Wall time: 1.24 s
In [11]: !du -hs fixed.parquet
980K fixed.parquet
In [12]: !du -hs notfixed.parquet
4.8M notfixed.parquet
In [13]: %time _ = ParquetFile('fixed.parquet').to_pandas()
CPU times: user 1.32 s, sys: 0 ns, total: 1.32 s
Wall time: 1.34 s
In [14]: %time _ = ParquetFile('notfixed.parquet').to_pandas()
CPU times: user 808 ms, sys: 8 ms, total: 816 ms
Wall time: 817 ms In this naive case it looks like fixed is slower but more compact. |
Fixed reading gets faster when using compression In [15]: %time fastparquet.write('fixed.parquet', df, fixed_text=True, compression='snappy')
CPU times: user 1.2 s, sys: 8 ms, total: 1.2 s
Wall time: 1.2 s
In [16]: %time fastparquet.write('notfixed.parquet', df, fixed_text=False, compression='snappy')
CPU times: user 920 ms, sys: 12 ms, total: 932 ms
Wall time: 938 ms
In [17]: %time _ = ParquetFile('notfixed.parquet').to_pandas()
CPU times: user 812 ms, sys: 0 ns, total: 812 ms
Wall time: 810 ms
In [18]: %time _ = ParquetFile('fixed.parquet').to_pandas()
CPU times: user 388 ms, sys: 8 ms, total: 396 ms
Wall time: 398 ms |
Should we special case non-nullable datetimes? |
Fixed also gets much faster when there are no nulls (which is where my intuition came from) - but note that the input here is still Object, so there is a conversion on write to S type in the fixed case
|
One problem is calling find_type twice: once to get the schema definition and once to convert the column; both need to calculate the maximum string length in the data; but this logic should be skipped, and I realise is potentially wrong if different row groups have different maximum string lengths. |
Thoughts on setting In [24]: %time df.x.isnull().any()
CPU times: user 52 ms, sys: 0 ns, total: 52 ms
Wall time: 54.3 ms
Out[24]: False (or perhaps It might be worth setting (not)nullable metadata even on floats so that other tools know how to handle things. |
Can we defer this responsibility to the user short term? I feel like in the case of fixed length strings users might have a pretty good idea of their string lengths like "always 1" or "always 4". |
We have a few issues interleaved here. I suggest we regroup on Monday. For now, I think asking the user for string length when using fixed strings is fine (another keyword?), which is sort of the state before this PR, where they had to convert to S by hand. I see that hasnans is much faster than isnull().any(), but doesn't actually seems to pick up a None in an object column. |
This is the speed-up that I was hoping to achieve - here with relatively long strings and no compression/disc access:
Note that the output of the latter is |S12, and pandas would convert that back to object as soon as you do anything to the column aside from selecting. Tagging Interestingly, a big chunk of the first method is in the struct and adding up the parts of the bytes before the final join: |
How common is the relatively long string case? I just want to make sure that we're optimizing our optimization efforts.
To make sure I understand, you're finding that adding bytestrings is taking up most of the time, yes? I suppose each addition creates a new Python object. It might be nicer to pre-allocate an array of bytes and then insert the bytes into the correct locations. Not sure. I just tried this a few ways and didn't get any better results. Also, for comparision, here's cythonized msgpack: In [40]: %time len(pd.msgpack.dumps(x.tolist(), use_bin_type=True))
CPU times: user 608 ms, sys: 80 ms, total: 688 ms
Wall time: 693 ms
Out[40]: 83333681 |
On commonness of long strings: the speed-up factor is more pronounced for shorter byte-strings. Correct, the biggest contributor is the + operation between the bytestrings within the comprehension. Unfortunately, I don't think there's a way within numba to access the raw string referenced as a python string object within an object array or list. |
Due to the unecessary overhead of having to calculat the maximum string type int he schema and for the column conversion, now explicitly require array string length in order to do object->fixed conversion.
I like the fixed length text changes that make it an explicit input from the user. The API seems intuitive to me. I have a slight preference for something more verbose, like |
Fixes #24