Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Improve conditional_join #1224

Merged
merged 47 commits into from Mar 2, 2023
Merged

[ENH] Improve conditional_join #1224

merged 47 commits into from Mar 2, 2023

Conversation

samukweku
Copy link
Collaborator

@samukweku samukweku commented Dec 17, 2022

PR Description

Please describe the changes proposed in the pull request:

  • Improve selection in conditional_join to allow user return either only df or right
  • indicator parameter added - similar to the indicator parameter in pd.merge
  • rewrite numba code, such that numba is called only once - numpy used as much as possible, where possible
  • improve code for _range_indices algorithm, taking advantage of binary search for improved performance, for scenarios where both columns on the right dataframe are monotonically increasing

This PR resolves #1223 .

Speed test when both columns on the right are monotonically increasing

In [3]: import pandas as pd
   ...: import janitor
   ...: import numpy as np

# adapted from https://stackoverflow.com/a/41094529/7175713
In [4]: # Sample df1.
   ...: n1 = 50_000
   ...: df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})
   ...: 
   ...: # Sample df2.
   ...: n2 = 4_00
   ...: df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})
   ...: 
   ...: # Randomly shift the start and end dates of the df2 intervals.
   ...: shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
   ...: shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
   ...: shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
   ...: df2['start_date'] += shift_start
   ...: df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2

# this PR 
# _range_indices performs well here - the numpy option sorts only once (if at all)
# and such does less work than the numba option - sorting is relatively expensive
# so if it can be avoided great
# numba shines for overlaps or scenarios where the columns are not both monotically increasing
# and for relatively large dataframes
In [5]: %timeit df1.assign(end = df1.date + pd.DateOffset(minutes=9, seconds=59)).conditional_join(df2, ('date', 'end_date', '<='),('end', 'start_date','>='), use_numb a = False)
5.8 ms ± 764 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit df1.assign(end = df1.date + pd.DateOffset(minutes=9, seconds=59)).conditional_join(df2, ('date', 'end_date', '<='),('end', 'start_date','>='), use_numba = True)
10.6 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Dev
In [7]: %timeit df1.assign(end = df1.date + pd.DateOffset(minutes=9, seconds=59)).conditional_join(df2, ('date', 'end_date', '<='), ('end', 'start_date','>='), use_numba = False)
54 ms ± 8.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit df1.assign(end = df1.date + pd.DateOffset(minutes=9, seconds=59)).conditional_join(df2, ('date', 'end_date', '<='), ('end', 'start_date','>='), use_numba = True)
12.4 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

indicator parameter usage:

In [5]: df1 = pd.DataFrame({'id': [1,1,1,2,2,3],
                        'value_1': [2,5,7,1,3,4]})
   
   df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3],
                        'value_2A': [0,3,7,12,0,2,3,1],
                       'value_2B': [1,5,9,15,1,4,6,3]})

In [6]: (df1
   .conditional_join(
        df2,
        ('id', 'id', '=='),
        ('value_1', 'value_2A', '>='),
       ('value_1', 'value_2B', '<='),
        how='left',
       sort_by_appearance=True,
        indicator=True,
       # this ensures only the `df` dataframe is returned
        right_columns=None
    ))
Out[6]: 
   id  value_1     _merge
0   1        2  left_only
1   1        5       both
2   1        7       both
3   2        1       both
4   2        3       both
5   2        3       both
6   3        4  left_only

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.md.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl
Copy link
Member

ericmjl commented Dec 17, 2022

@codecov
Copy link

codecov bot commented Dec 17, 2022

Codecov Report

Merging #1224 (d52ccde) into dev (7ad98e3) will increase coverage by 15.13%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##              dev    #1224       +/-   ##
===========================================
+ Coverage   82.51%   97.64%   +15.13%     
===========================================
  Files          78       78               
  Lines        3770     3617      -153     
===========================================
+ Hits         3111     3532      +421     
+ Misses        659       85      -574     

@samukweku samukweku closed this Dec 19, 2022
@samukweku samukweku deleted the samukweku/cond_join_fix branch December 19, 2022 22:07
@ericmjl
Copy link
Member

ericmjl commented Dec 19, 2022

Hey @samukweku, it looked like the PR was heading in a good direction. Any reason for closing?

@samukweku samukweku restored the samukweku/cond_join_fix branch December 20, 2022 00:15
@samukweku
Copy link
Collaborator Author

I just did a minor speed test and noticed a regression. Need to spend some more time on the logic. I'll open it again if I can improve on the logic. If not I'll keep the existing logic

@ericmjl
Copy link
Member

ericmjl commented Dec 20, 2022

I love the rigor here, @samukweku! ❤️

…or scenarios where the range columns are both monotonically increasing when sorted
@ericmjl
Copy link
Member

ericmjl commented Dec 28, 2022

@samukweku when you're ready for review, do let us know!

Maybe we should also try using draft vs. actual PRs in the future?

@samukweku
Copy link
Collaborator Author

@ericmjl my bad... It is ready for review. Thanks for the draft suggestion... I will def use that for future PRs

@ericmjl
Copy link
Member

ericmjl commented Jan 28, 2023

@samukweku in the interest of keeping things moving, I will merge this when all of the CI checks are done.

@ericmjl
Copy link
Member

ericmjl commented Feb 14, 2023

@samukweku I think this PR can be merged once the pre-commit hooks are satisfied!

@samukweku
Copy link
Collaborator Author

@ericmjl I think there is an issue with the pre commit ci - locally everything checks fine (can't fit push without satisfying d pre commit hooks)

@samukweku
Copy link
Collaborator Author

@ericmjl the precommit hooks times out - no idea why :

image

@ericmjl
Copy link
Member

ericmjl commented Mar 2, 2023

No worries, @samukweku. I realize that this PR is quite old now. I'm going to clean up the PRs here by reviewing or merging them, starting with this one!

@ericmjl ericmjl merged commit 468aeb4 into dev Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] Improve selection/perf in conditional_join
2 participants