Better pandas integration #5189

irgolic · 2021-01-17T15:17:16Z

Bases off #5190.

Issue

At the moment, our pandas solution copies the X, Y and meta numpy arrays into one contiguous array. It also sets its column dtypes. This is inefficient, and loses information like global row IDs, roles, variable info.

The proposed solution is not an attempt at migration, but rather a seamless integration (both Orange and pandas operate on numpy arrays). This will be used in Python Script, but could also be leveraged by widget code.

Description of changes

Added a new table_to_frames method, which wraps the three existing numpy arrays into three DataFrames. This is near-instantaneous, as no copying occurs, it simply wraps the preexisting data structures. Also:

it keeps global row index information by setting pandas indices in _o1, _o2, _o3 format
upon conversion back to table, it copies the columns' original variables (setting compute_value to None)
it remembers what role the converted table represented (feature/target/meta), and applies the role to all its columns upon conversion back to table

An edge case: how do you handle concatenate when one of the frames has weights and the other doesn't?

Includes

Code changes
Tests
Documentation

irgolic · 2021-01-17T20:12:49Z

Table stores X, Y, and metas in three separate numpy arrays. DataFrame depends on numpy, just like us. So it's possible to simply wrap each of the tables into a DataFrame.

Orange provides table_from_frame and table_to_frame functions. When generating a DataFrame from a Table, all three numpy arrays (X, Y, metas) are copied to a contiguous location, and the DataFrame's dtypes are set. When converting back, variables types are inferred. The global row index is lost. This creates a fully-fledged standalone DataFrame, with no Orange footprint.

At some point, we contemplated taking Table and stuffing it into the DataFrame interface. This would be nice, but a rewrite would be better probably. Maybe in Orange4, when we separate server and client.

This PR adds a table_to_frames function, which returns three DataFrames: X, Y, and metas. The DataFrame's index is set to _o1, _o2, _o3 format. In its metadata, it keeps the original orange variables (associated by column name), and the frame's role (attribute/target/meta). When converting back to an Orange table, global row indices are kept. The domain is constructed by copying the original column and setting compute_value=None (or guess new columns).

I've added some benchmarks too, wherein table is an Orange table, df is a natural DataFrame, orangedf is DataFrame converted from an Orange table. Normalization is tested both with pandas vectorization and numpy vectorization (https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6). They're all one-liners, check the code if benchmark names are unclear.

Dense Benchmark

cols = 100
rows = 100000

[create_df] with 5 loops, best of 3:
	min 96.3 usec per loop
	avg 109 usec per loop
[create_orangedf] with 5 loops, best of 3:
	min 238 msec per loop
	avg 239 msec per loop
[create_table] with 5 loops, best of 3:
	min 37.7 msec per loop
	avg 38 msec per loop

[multiply_df] with 5 loops, best of 3:
	min 14.6 msec per loop
	avg 14.9 msec per loop
[multiply_df_numpy] with 5 loops, best of 3:
	min 20.5 msec per loop
	avg 21.3 msec per loop
[multiply_orangedf] with 5 loops, best of 3:
	min 14.7 msec per loop
	avg 14.9 msec per loop
[multiply_orangedf_numpy] with 5 loops, best of 3:
	min 20.6 msec per loop
	avg 21.2 msec per loop
[multiply_table] with 5 loops, best of 3:
	min 20.2 msec per loop
	avg 20.5 msec per loop

[normalize_df] with 5 loops, best of 3:
	min 360 msec per loop
	avg 362 msec per loop
[normalize_df_numpy] with 5 loops, best of 3:
	min 113 msec per loop
	avg 114 msec per loop
[normalize_orangedf] with 5 loops, best of 3:
	min 368 msec per loop
	avg 376 msec per loop
[normalize_orangedf_numpy] with 5 loops, best of 3:
	min 98.7 msec per loop
	avg 99.5 msec per loop
[normalize_table] with 5 loops, best of 3:
	min 511 msec per loop
	avg 516 msec per loop

[revert_orangedf] with 5 loops, best of 3:
	min 109 msec per loop
	avg 109 msec per loop

cols = 10000
rows = 100

[create_df] with 5 loops, best of 3:
	min 132 usec per loop
	avg 150 usec per loop
[create_orangedf] with 5 loops, best of 3:
	min 2.72 msec per loop
	avg 2.87 msec per loop
[create_table] with 5 loops, best of 3:
	min 1.23 msec per loop
	avg 1.35 msec per loop

[multiply_df] with 5 loops, best of 3:
	min 775 usec per loop
	avg 792 usec per loop
[multiply_df_numpy] with 5 loops, best of 3:
	min 741 usec per loop
	avg 763 usec per loop
[multiply_orangedf] with 5 loops, best of 3:
	min 785 usec per loop
	avg 798 usec per loop
[multiply_orangedf_numpy] with 5 loops, best of 3:
	min 849 usec per loop
	avg 866 usec per loop
[multiply_table] with 5 loops, best of 3:
	min 740 usec per loop
	avg 781 usec per loop

[normalize_df] with 5 loops, best of 3:
	min 5.25 sec per loop
	avg 5.29 sec per loop
[normalize_df_numpy] with 5 loops, best of 3:
	min 8.02 msec per loop
	avg 8.28 msec per loop
[normalize_orangedf] with 5 loops, best of 3:
	min 5.24 sec per loop
	avg 5.27 sec per loop
[normalize_orangedf_numpy] with 5 loops, best of 3:
	min 6.55 msec per loop
	avg 6.63 msec per loop
[normalize_table] with 5 loops, best of 3:
	min 2.26 sec per loop
	avg 2.27 sec per loop

[revert_orangedf] with 5 loops, best of 3:
	min 114 msec per loop
	avg 116 msec per loop

codecov · 2021-01-17T20:23:16Z

Codecov Report

Merging #5189 (b9f53ba) into master (31ba74a) will decrease coverage by 0.02%.
The diff coverage is 87.50%.

@@            Coverage Diff             @@
##           master    #5189      +/-   ##
==========================================
- Coverage   85.36%   85.34%   -0.03%     
==========================================
  Files         301      301              
  Lines       61833    62001     +168     
==========================================
+ Hits        52784    52914     +130     
- Misses       9049     9087      +38

irgolic · 2021-01-18T21:33:29Z

The above benchmark is not incorrect, I've just rerun it again. On arrays with a lot of rows, we perform about as well as the DataFrame (x5 times slower on normalize). On arrays with a lot of columns, we're 1000x slower. It seems that we optimize operations for rows, and pandas manages to optimize for both rows and columns. (it could also be just the way normalize is implemented in Orange, please suggest more benchmarks)

If pandas DataFrames were completely interchangeable on the same piece of memory with Orange Tables, you could instantiate a DataFrame from a Table, run the math, and turn it back into a table. We're a couple steps away from that though. compute_value is set to None during DataFrame -> Table conversion (is this a problem?). More importantly, new df columns (not those which came from a Table -> DataFrame conversion) must sometimes be preprocessed, which creates a whole new numpy array.
I assume they must be preprocessed due to the way table_from_frame was already implemented:

for name, s in df.items():
    if _is_discrete(s):
        discrete = s.astype('category').cat
        attrs.append(DiscreteVariable(name, discrete.categories.astype(str).tolist()))
        X.append(discrete.codes.replace(-1, np.nan).values)
    elif _is_datetime(s):
        tvar = TimeVariable(name)
        attrs.append(tvar)
        s = pd.to_datetime(s, infer_datetime_format=True)
        X.append(s.astype('str').replace('NaT', np.nan).map(tvar.parse).values)
    elif is_numeric_dtype(s):
        attrs.append(ContinuousVariable(name))
        X.append(s.values)
    else:
        metas.append(StringVariable(name))
        M.append(s.values.astype(object))

The above code creates a new numpy array for each discrete/datetime variable, but otherwise uses the original array. If we were to adjust Table/Variable such that it would be fully compatible with DataFrame's underlying numpy array, it could truly be used interchangeably with an overhead of practically constant complexity.

I've added support for sparse Tables as well. Pandas stores these column by column (coo format), so a new array is created during conversion. In Python Script, if a sparse array is passed in as data and automatic pandas conversion is chosen, a warning will be shown that this is inefficient.

Sparse Benchmark

cols = 100
rows = 100000

[create_df] with 5 loops, best of 3:
	min 187 msec per loop
	avg 206 msec per loop
[create_orangedf] with 5 loops, best of 3:
	min 434 msec per loop
	avg 439 msec per loop
[create_table] with 5 loops, best of 3:
	min 10.9 msec per loop
	avg 11.1 msec per loop

[multiply_df] with 5 loops, best of 3:
	min 28.6 msec per loop
	avg 29.2 msec per loop
[multiply_df_numpy] with 5 loops, best of 3:
	min 59.9 msec per loop
	avg 60.7 msec per loop
[multiply_orangedf] with 5 loops, best of 3:
	min 27.3 msec per loop
	avg 27.7 msec per loop
[multiply_orangedf_numpy] with 5 loops, best of 3:
	min 58.9 msec per loop
	avg 60.6 msec per loop
[multiply_table] with 5 loops, best of 3:
	min 12.8 msec per loop
	avg 13.9 msec per loop

[normalize_df] with 5 loops, best of 3:
	min 246 msec per loop
	avg 251 msec per loop
[normalize_df_numpy] with 5 loops, best of 3:
	min 141 msec per loop
	avg 145 msec per loop
[normalize_orangedf] with 5 loops, best of 3:
	min 239 msec per loop
	avg 244 msec per loop
[normalize_orangedf_numpy] with 5 loops, best of 3:
	min 146 msec per loop
	avg 148 msec per loop
[normalize_table] with 5 loops, best of 3:
	min 601 msec per loop
	avg 617 msec per loop

[revert_orangedf] with 5 loops, best of 3:
	min 343 msec per loop
	avg 351 msec per loop

Orange/data/pandas_compat.py

Orange/data/table.py

markotoplak · 2021-02-25T09:42:23Z

For now, I only skimmed the code and wrote some petty comments. I still did not seriously try it out and think about the interface...

markotoplak · 2021-03-01T19:39:01Z

When I was looking into this today I had problems with the setters, with both (1) that the change the table and (2) that they are not atomic (separate change of the domain and then of the array could be problematic for multithreading). Then, in discussions with @VesnaT, we saw that we already have exact same problems with Table's value arrays (see #5303).

So now I think that the proposed interface is fine (#5303 needs to be handled separately). @lanzagar, what do you think of the interface added to the Table?

irgolic marked this pull request as draft January 17, 2021 15:23

irgolic force-pushed the pandas-support branch 3 times, most recently from 095ab5b to f6b0937 Compare January 17, 2021 20:11

irgolic force-pushed the pandas-support branch from f6b0937 to cd24814 Compare January 17, 2021 21:32

irgolic marked this pull request as ready for review January 17, 2021 21:53

irgolic force-pushed the pandas-support branch 3 times, most recently from 91caf1d to e0ce7f2 Compare January 18, 2021 21:26

irgolic mentioned this pull request Jan 20, 2021

BUG: 2D ndarray of dtype 'object' is always copied upon construction pandas-dev/pandas#39272

Merged

4 tasks

irgolic force-pushed the pandas-support branch from e0ce7f2 to ebecef3 Compare January 22, 2021 08:00

janezd assigned markotoplak and lanzagar Jan 22, 2021

irgolic force-pushed the pandas-support branch 7 times, most recently from 8a4e22e to 026dc58 Compare January 24, 2021 00:22

janezd added the merge after release Potentially unstable and needs to be tested well. label Feb 5, 2021

irgolic force-pushed the pandas-support branch from 026dc58 to 1e24f88 Compare February 15, 2021 15:43

markotoplak reviewed Feb 25, 2021

View reviewed changes

Orange/data/pandas_compat.py Show resolved Hide resolved

markotoplak reviewed Feb 25, 2021

View reviewed changes

Orange/data/table.py Outdated Show resolved Hide resolved

markotoplak reviewed Feb 25, 2021

View reviewed changes

Orange/data/table.py Show resolved Hide resolved

irgolic removed the merge after release Potentially unstable and needs to be tested well. label Feb 26, 2021

irgolic added this to the 3.28.0 milestone Feb 26, 2021

irgolic added 3 commits February 26, 2021 17:19

pandas_compat: Implement interchangable Table/DF

a5b0f92

pylint

675cca4

pandas_compat: Use np.array instead of .values

b9f53ba

irgolic force-pushed the pandas-support branch from 1e24f88 to b9f53ba Compare February 26, 2021 17:19

markotoplak mentioned this pull request Mar 1, 2021

Table has public attributes we assume won't change (but can) #5303

Closed

lanzagar merged commit 5135e7f into biolab:master Mar 5, 2021

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better pandas integration #5189

Better pandas integration #5189

irgolic commented Jan 17, 2021 •

edited

Loading

irgolic commented Jan 17, 2021 •

edited

Loading

codecov bot commented Jan 17, 2021 •

edited

Loading

irgolic commented Jan 18, 2021 •

edited

Loading

markotoplak commented Feb 25, 2021

markotoplak commented Mar 1, 2021

Better pandas integration #5189

Better pandas integration #5189

Conversation

irgolic commented Jan 17, 2021 • edited Loading

Issue

Description of changes

Includes

irgolic commented Jan 17, 2021 • edited Loading

Dense Benchmark

codecov bot commented Jan 17, 2021 • edited Loading

Codecov Report

irgolic commented Jan 18, 2021 • edited Loading

Sparse Benchmark

markotoplak commented Feb 25, 2021

markotoplak commented Mar 1, 2021

irgolic commented Jan 17, 2021 •

edited

Loading

irgolic commented Jan 17, 2021 •

edited

Loading

codecov bot commented Jan 17, 2021 •

edited

Loading

irgolic commented Jan 18, 2021 •

edited

Loading