Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better pandas integration #5189

Merged
merged 3 commits into from
Mar 5, 2021
Merged

Better pandas integration #5189

merged 3 commits into from
Mar 5, 2021

Conversation

irgolic
Copy link
Member

@irgolic irgolic commented Jan 17, 2021

Bases off #5190.

Issue

At the moment, our pandas solution copies the X, Y and meta numpy arrays into one contiguous array. It also sets its column dtypes. This is inefficient, and loses information like global row IDs, roles, variable info.

The proposed solution is not an attempt at migration, but rather a seamless integration (both Orange and pandas operate on numpy arrays). This will be used in Python Script, but could also be leveraged by widget code.

Description of changes

Added a new table_to_frames method, which wraps the three existing numpy arrays into three DataFrames. This is near-instantaneous, as no copying occurs, it simply wraps the preexisting data structures. Also:

  • it keeps global row index information by setting pandas indices in _o1, _o2, _o3 format
  • upon conversion back to table, it copies the columns' original variables (setting compute_value to None)
  • it remembers what role the converted table represented (feature/target/meta), and applies the role to all its columns upon conversion back to table

An edge case: how do you handle concatenate when one of the frames has weights and the other doesn't?

Includes
  • Code changes
  • Tests
  • Documentation

@irgolic irgolic marked this pull request as draft January 17, 2021 15:23
@irgolic irgolic force-pushed the pandas-support branch 3 times, most recently from 095ab5b to f6b0937 Compare January 17, 2021 20:11
@irgolic
Copy link
Member Author

irgolic commented Jan 17, 2021

Table stores X, Y, and metas in three separate numpy arrays. DataFrame depends on numpy, just like us. So it's possible to simply wrap each of the tables into a DataFrame.

Orange provides table_from_frame and table_to_frame functions. When generating a DataFrame from a Table, all three numpy arrays (X, Y, metas) are copied to a contiguous location, and the DataFrame's dtypes are set. When converting back, variables types are inferred. The global row index is lost. This creates a fully-fledged standalone DataFrame, with no Orange footprint.

At some point, we contemplated taking Table and stuffing it into the DataFrame interface. This would be nice, but a rewrite would be better probably. Maybe in Orange4, when we separate server and client.

This PR adds a table_to_frames function, which returns three DataFrames: X, Y, and metas. The DataFrame's index is set to _o1, _o2, _o3 format. In its metadata, it keeps the original orange variables (associated by column name), and the frame's role (attribute/target/meta). When converting back to an Orange table, global row indices are kept. The domain is constructed by copying the original column and setting compute_value=None (or guess new columns).

I've added some benchmarks too, wherein table is an Orange table, df is a natural DataFrame, orangedf is DataFrame converted from an Orange table. Normalization is tested both with pandas vectorization and numpy vectorization (https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6). They're all one-liners, check the code if benchmark names are unclear.

Dense Benchmark

cols = 100
rows = 100000
[create_df] with 5 loops, best of 3:
	min 96.3 usec per loop
	avg 109 usec per loop
[create_orangedf] with 5 loops, best of 3:
	min 238 msec per loop
	avg 239 msec per loop
[create_table] with 5 loops, best of 3:
	min 37.7 msec per loop
	avg 38 msec per loop

[multiply_df] with 5 loops, best of 3:
	min 14.6 msec per loop
	avg 14.9 msec per loop
[multiply_df_numpy] with 5 loops, best of 3:
	min 20.5 msec per loop
	avg 21.3 msec per loop
[multiply_orangedf] with 5 loops, best of 3:
	min 14.7 msec per loop
	avg 14.9 msec per loop
[multiply_orangedf_numpy] with 5 loops, best of 3:
	min 20.6 msec per loop
	avg 21.2 msec per loop
[multiply_table] with 5 loops, best of 3:
	min 20.2 msec per loop
	avg 20.5 msec per loop

[normalize_df] with 5 loops, best of 3:
	min 360 msec per loop
	avg 362 msec per loop
[normalize_df_numpy] with 5 loops, best of 3:
	min 113 msec per loop
	avg 114 msec per loop
[normalize_orangedf] with 5 loops, best of 3:
	min 368 msec per loop
	avg 376 msec per loop
[normalize_orangedf_numpy] with 5 loops, best of 3:
	min 98.7 msec per loop
	avg 99.5 msec per loop
[normalize_table] with 5 loops, best of 3:
	min 511 msec per loop
	avg 516 msec per loop

[revert_orangedf] with 5 loops, best of 3:
	min 109 msec per loop
	avg 109 msec per loop
cols = 10000
rows = 100
[create_df] with 5 loops, best of 3:
	min 132 usec per loop
	avg 150 usec per loop
[create_orangedf] with 5 loops, best of 3:
	min 2.72 msec per loop
	avg 2.87 msec per loop
[create_table] with 5 loops, best of 3:
	min 1.23 msec per loop
	avg 1.35 msec per loop

[multiply_df] with 5 loops, best of 3:
	min 775 usec per loop
	avg 792 usec per loop
[multiply_df_numpy] with 5 loops, best of 3:
	min 741 usec per loop
	avg 763 usec per loop
[multiply_orangedf] with 5 loops, best of 3:
	min 785 usec per loop
	avg 798 usec per loop
[multiply_orangedf_numpy] with 5 loops, best of 3:
	min 849 usec per loop
	avg 866 usec per loop
[multiply_table] with 5 loops, best of 3:
	min 740 usec per loop
	avg 781 usec per loop

[normalize_df] with 5 loops, best of 3:
	min 5.25 sec per loop
	avg 5.29 sec per loop
[normalize_df_numpy] with 5 loops, best of 3:
	min 8.02 msec per loop
	avg 8.28 msec per loop
[normalize_orangedf] with 5 loops, best of 3:
	min 5.24 sec per loop
	avg 5.27 sec per loop
[normalize_orangedf_numpy] with 5 loops, best of 3:
	min 6.55 msec per loop
	avg 6.63 msec per loop
[normalize_table] with 5 loops, best of 3:
	min 2.26 sec per loop
	avg 2.27 sec per loop

[revert_orangedf] with 5 loops, best of 3:
	min 114 msec per loop
	avg 116 msec per loop

@codecov
Copy link

codecov bot commented Jan 17, 2021

Codecov Report

Merging #5189 (b9f53ba) into master (31ba74a) will decrease coverage by 0.02%.
The diff coverage is 87.50%.

@@            Coverage Diff             @@
##           master    #5189      +/-   ##
==========================================
- Coverage   85.36%   85.34%   -0.03%     
==========================================
  Files         301      301              
  Lines       61833    62001     +168     
==========================================
+ Hits        52784    52914     +130     
- Misses       9049     9087      +38     

@irgolic irgolic marked this pull request as ready for review January 17, 2021 21:53
@irgolic irgolic force-pushed the pandas-support branch 3 times, most recently from 91caf1d to e0ce7f2 Compare January 18, 2021 21:26
@irgolic
Copy link
Member Author

irgolic commented Jan 18, 2021

The above benchmark is not incorrect, I've just rerun it again. On arrays with a lot of rows, we perform about as well as the DataFrame (x5 times slower on normalize). On arrays with a lot of columns, we're 1000x slower. It seems that we optimize operations for rows, and pandas manages to optimize for both rows and columns. (it could also be just the way normalize is implemented in Orange, please suggest more benchmarks)

If pandas DataFrames were completely interchangeable on the same piece of memory with Orange Tables, you could instantiate a DataFrame from a Table, run the math, and turn it back into a table. We're a couple steps away from that though. compute_value is set to None during DataFrame -> Table conversion (is this a problem?). More importantly, new df columns (not those which came from a Table -> DataFrame conversion) must sometimes be preprocessed, which creates a whole new numpy array.
I assume they must be preprocessed due to the way table_from_frame was already implemented:

for name, s in df.items():
    if _is_discrete(s):
        discrete = s.astype('category').cat
        attrs.append(DiscreteVariable(name, discrete.categories.astype(str).tolist()))
        X.append(discrete.codes.replace(-1, np.nan).values)
    elif _is_datetime(s):
        tvar = TimeVariable(name)
        attrs.append(tvar)
        s = pd.to_datetime(s, infer_datetime_format=True)
        X.append(s.astype('str').replace('NaT', np.nan).map(tvar.parse).values)
    elif is_numeric_dtype(s):
        attrs.append(ContinuousVariable(name))
        X.append(s.values)
    else:
        metas.append(StringVariable(name))
        M.append(s.values.astype(object))

The above code creates a new numpy array for each discrete/datetime variable, but otherwise uses the original array. If we were to adjust Table/Variable such that it would be fully compatible with DataFrame's underlying numpy array, it could truly be used interchangeably with an overhead of practically constant complexity.

I've added support for sparse Tables as well. Pandas stores these column by column (coo format), so a new array is created during conversion. In Python Script, if a sparse array is passed in as data and automatic pandas conversion is chosen, a warning will be shown that this is inefficient.

Sparse Benchmark

cols = 100
rows = 100000
[create_df] with 5 loops, best of 3:
	min 187 msec per loop
	avg 206 msec per loop
[create_orangedf] with 5 loops, best of 3:
	min 434 msec per loop
	avg 439 msec per loop
[create_table] with 5 loops, best of 3:
	min 10.9 msec per loop
	avg 11.1 msec per loop

[multiply_df] with 5 loops, best of 3:
	min 28.6 msec per loop
	avg 29.2 msec per loop
[multiply_df_numpy] with 5 loops, best of 3:
	min 59.9 msec per loop
	avg 60.7 msec per loop
[multiply_orangedf] with 5 loops, best of 3:
	min 27.3 msec per loop
	avg 27.7 msec per loop
[multiply_orangedf_numpy] with 5 loops, best of 3:
	min 58.9 msec per loop
	avg 60.6 msec per loop
[multiply_table] with 5 loops, best of 3:
	min 12.8 msec per loop
	avg 13.9 msec per loop

[normalize_df] with 5 loops, best of 3:
	min 246 msec per loop
	avg 251 msec per loop
[normalize_df_numpy] with 5 loops, best of 3:
	min 141 msec per loop
	avg 145 msec per loop
[normalize_orangedf] with 5 loops, best of 3:
	min 239 msec per loop
	avg 244 msec per loop
[normalize_orangedf_numpy] with 5 loops, best of 3:
	min 146 msec per loop
	avg 148 msec per loop
[normalize_table] with 5 loops, best of 3:
	min 601 msec per loop
	avg 617 msec per loop

[revert_orangedf] with 5 loops, best of 3:
	min 343 msec per loop
	avg 351 msec per loop

Orange/data/table.py Outdated Show resolved Hide resolved
@markotoplak
Copy link
Member

For now, I only skimmed the code and wrote some petty comments. I still did not seriously try it out and think about the interface...

@irgolic irgolic removed the merge after release Potentially unstable and needs to be tested well. label Feb 26, 2021
@irgolic irgolic added this to the 3.28.0 milestone Feb 26, 2021
@markotoplak
Copy link
Member

When I was looking into this today I had problems with the setters, with both (1) that the change the table and (2) that they are not atomic (separate change of the domain and then of the array could be problematic for multithreading). Then, in discussions with @VesnaT, we saw that we already have exact same problems with Table's value arrays (see #5303).

So now I think that the proposed interface is fine (#5303 needs to be handled separately). @lanzagar, what do you think of the interface added to the Table?

@lanzagar lanzagar merged commit 5135e7f into biolab:master Mar 5, 2021
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants