-
-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better pandas integration #5189
Conversation
095ab5b
to
f6b0937
Compare
Table stores X, Y, and metas in three separate numpy arrays. DataFrame depends on numpy, just like us. So it's possible to simply wrap each of the tables into a DataFrame. Orange provides At some point, we contemplated taking Table and stuffing it into the DataFrame interface. This would be nice, but a rewrite would be better probably. Maybe in Orange4, when we separate server and client. This PR adds a I've added some benchmarks too, wherein Dense Benchmarkcols = 100
rows = 100000 [create_df] with 5 loops, best of 3:
min 96.3 usec per loop
avg 109 usec per loop
[create_orangedf] with 5 loops, best of 3:
min 238 msec per loop
avg 239 msec per loop
[create_table] with 5 loops, best of 3:
min 37.7 msec per loop
avg 38 msec per loop
[multiply_df] with 5 loops, best of 3:
min 14.6 msec per loop
avg 14.9 msec per loop
[multiply_df_numpy] with 5 loops, best of 3:
min 20.5 msec per loop
avg 21.3 msec per loop
[multiply_orangedf] with 5 loops, best of 3:
min 14.7 msec per loop
avg 14.9 msec per loop
[multiply_orangedf_numpy] with 5 loops, best of 3:
min 20.6 msec per loop
avg 21.2 msec per loop
[multiply_table] with 5 loops, best of 3:
min 20.2 msec per loop
avg 20.5 msec per loop
[normalize_df] with 5 loops, best of 3:
min 360 msec per loop
avg 362 msec per loop
[normalize_df_numpy] with 5 loops, best of 3:
min 113 msec per loop
avg 114 msec per loop
[normalize_orangedf] with 5 loops, best of 3:
min 368 msec per loop
avg 376 msec per loop
[normalize_orangedf_numpy] with 5 loops, best of 3:
min 98.7 msec per loop
avg 99.5 msec per loop
[normalize_table] with 5 loops, best of 3:
min 511 msec per loop
avg 516 msec per loop
[revert_orangedf] with 5 loops, best of 3:
min 109 msec per loop
avg 109 msec per loop cols = 10000
rows = 100 [create_df] with 5 loops, best of 3:
min 132 usec per loop
avg 150 usec per loop
[create_orangedf] with 5 loops, best of 3:
min 2.72 msec per loop
avg 2.87 msec per loop
[create_table] with 5 loops, best of 3:
min 1.23 msec per loop
avg 1.35 msec per loop
[multiply_df] with 5 loops, best of 3:
min 775 usec per loop
avg 792 usec per loop
[multiply_df_numpy] with 5 loops, best of 3:
min 741 usec per loop
avg 763 usec per loop
[multiply_orangedf] with 5 loops, best of 3:
min 785 usec per loop
avg 798 usec per loop
[multiply_orangedf_numpy] with 5 loops, best of 3:
min 849 usec per loop
avg 866 usec per loop
[multiply_table] with 5 loops, best of 3:
min 740 usec per loop
avg 781 usec per loop
[normalize_df] with 5 loops, best of 3:
min 5.25 sec per loop
avg 5.29 sec per loop
[normalize_df_numpy] with 5 loops, best of 3:
min 8.02 msec per loop
avg 8.28 msec per loop
[normalize_orangedf] with 5 loops, best of 3:
min 5.24 sec per loop
avg 5.27 sec per loop
[normalize_orangedf_numpy] with 5 loops, best of 3:
min 6.55 msec per loop
avg 6.63 msec per loop
[normalize_table] with 5 loops, best of 3:
min 2.26 sec per loop
avg 2.27 sec per loop
[revert_orangedf] with 5 loops, best of 3:
min 114 msec per loop
avg 116 msec per loop |
Codecov Report
@@ Coverage Diff @@
## master #5189 +/- ##
==========================================
- Coverage 85.36% 85.34% -0.03%
==========================================
Files 301 301
Lines 61833 62001 +168
==========================================
+ Hits 52784 52914 +130
- Misses 9049 9087 +38 |
f6b0937
to
cd24814
Compare
91caf1d
to
e0ce7f2
Compare
The above benchmark is not incorrect, I've just rerun it again. On arrays with a lot of rows, we perform about as well as the DataFrame (x5 times slower on normalize). On arrays with a lot of columns, we're 1000x slower. It seems that we optimize operations for rows, and pandas manages to optimize for both rows and columns. (it could also be just the way normalize is implemented in Orange, please suggest more benchmarks) If pandas DataFrames were completely interchangeable on the same piece of memory with Orange Tables, you could instantiate a DataFrame from a Table, run the math, and turn it back into a table. We're a couple steps away from that though. for name, s in df.items():
if _is_discrete(s):
discrete = s.astype('category').cat
attrs.append(DiscreteVariable(name, discrete.categories.astype(str).tolist()))
X.append(discrete.codes.replace(-1, np.nan).values)
elif _is_datetime(s):
tvar = TimeVariable(name)
attrs.append(tvar)
s = pd.to_datetime(s, infer_datetime_format=True)
X.append(s.astype('str').replace('NaT', np.nan).map(tvar.parse).values)
elif is_numeric_dtype(s):
attrs.append(ContinuousVariable(name))
X.append(s.values)
else:
metas.append(StringVariable(name))
M.append(s.values.astype(object)) The above code creates a new numpy array for each discrete/datetime variable, but otherwise uses the original array. If we were to adjust Table/Variable such that it would be fully compatible with DataFrame's underlying numpy array, it could truly be used interchangeably with an overhead of practically constant complexity. I've added support for sparse Tables as well. Pandas stores these column by column (coo format), so a new array is created during conversion. In Python Script, if a sparse array is passed in as data and automatic pandas conversion is chosen, a warning will be shown that this is inefficient. Sparse Benchmarkcols = 100
rows = 100000 [create_df] with 5 loops, best of 3:
min 187 msec per loop
avg 206 msec per loop
[create_orangedf] with 5 loops, best of 3:
min 434 msec per loop
avg 439 msec per loop
[create_table] with 5 loops, best of 3:
min 10.9 msec per loop
avg 11.1 msec per loop
[multiply_df] with 5 loops, best of 3:
min 28.6 msec per loop
avg 29.2 msec per loop
[multiply_df_numpy] with 5 loops, best of 3:
min 59.9 msec per loop
avg 60.7 msec per loop
[multiply_orangedf] with 5 loops, best of 3:
min 27.3 msec per loop
avg 27.7 msec per loop
[multiply_orangedf_numpy] with 5 loops, best of 3:
min 58.9 msec per loop
avg 60.6 msec per loop
[multiply_table] with 5 loops, best of 3:
min 12.8 msec per loop
avg 13.9 msec per loop
[normalize_df] with 5 loops, best of 3:
min 246 msec per loop
avg 251 msec per loop
[normalize_df_numpy] with 5 loops, best of 3:
min 141 msec per loop
avg 145 msec per loop
[normalize_orangedf] with 5 loops, best of 3:
min 239 msec per loop
avg 244 msec per loop
[normalize_orangedf_numpy] with 5 loops, best of 3:
min 146 msec per loop
avg 148 msec per loop
[normalize_table] with 5 loops, best of 3:
min 601 msec per loop
avg 617 msec per loop
[revert_orangedf] with 5 loops, best of 3:
min 343 msec per loop
avg 351 msec per loop
|
e0ce7f2
to
ebecef3
Compare
8a4e22e
to
026dc58
Compare
026dc58
to
1e24f88
Compare
For now, I only skimmed the code and wrote some petty comments. I still did not seriously try it out and think about the interface... |
1e24f88
to
b9f53ba
Compare
When I was looking into this today I had problems with the setters, with both (1) that the change the table and (2) that they are not atomic (separate change of the domain and then of the array could be problematic for multithreading). Then, in discussions with @VesnaT, we saw that we already have exact same problems with Table's value arrays (see #5303). So now I think that the proposed interface is fine (#5303 needs to be handled separately). @lanzagar, what do you think of the interface added to the Table? |
Bases off #5190.
Issue
At the moment, our pandas solution copies the X, Y and meta numpy arrays into one contiguous array. It also sets its column
dtypes
. This is inefficient, and loses information like global row IDs, roles, variable info.The proposed solution is not an attempt at migration, but rather a seamless integration (both Orange and pandas operate on numpy arrays). This will be used in Python Script, but could also be leveraged by widget code.
Description of changes
Added a new
table_to_frames
method, which wraps the three existing numpy arrays into three DataFrames. This is near-instantaneous, as no copying occurs, it simply wraps the preexisting data structures. Also:_o1, _o2, _o3
formatcompute_value
toNone
)An edge case: how do you handle concatenate when one of the frames has weights and the other doesn't?
Includes