[ENH] Improve `conditional_join` #1224

samukweku · 2022-12-17T09:30:11Z

PR Description

Please describe the changes proposed in the pull request:

Improve selection in conditional_join to allow user return either only df or right
indicator parameter added - similar to the indicator parameter in pd.merge
rewrite numba code, such that numba is called only once - numpy used as much as possible, where possible
improve code for _range_indices algorithm, taking advantage of binary search for improved performance, for scenarios where both columns on the right dataframe are monotonically increasing

This PR resolves #1223 .

Speed test when both columns on the right are monotonically increasing

In [3]: import pandas as pd
   ...: import janitor
   ...: import numpy as np

# adapted from https://stackoverflow.com/a/41094529/7175713
In [4]: # Sample df1.
   ...: n1 = 50_000
   ...: df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})
   ...: 
   ...: # Sample df2.
   ...: n2 = 4_00
   ...: df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})
   ...: 
   ...: # Randomly shift the start and end dates of the df2 intervals.
   ...: shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
   ...: shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
   ...: shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
   ...: df2['start_date'] += shift_start
   ...: df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2

# this PR 
# _range_indices performs well here - the numpy option sorts only once (if at all)
# and such does less work than the numba option - sorting is relatively expensive
# so if it can be avoided great
# numba shines for overlaps or scenarios where the columns are not both monotically increasing
# and for relatively large dataframes
In [5]: %timeit df1.assign(end = df1.date + pd.DateOffset(minutes=9, seconds=59)).conditional_join(df2, ('date', 'end_date', '<='),('end', 'start_date','>='), use_numb a = False)
5.8 ms ± 764 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit df1.assign(end = df1.date + pd.DateOffset(minutes=9, seconds=59)).conditional_join(df2, ('date', 'end_date', '<='),('end', 'start_date','>='), use_numba = True)
10.6 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Dev
In [7]: %timeit df1.assign(end = df1.date + pd.DateOffset(minutes=9, seconds=59)).conditional_join(df2, ('date', 'end_date', '<='), ('end', 'start_date','>='), use_numba = False)
54 ms ± 8.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit df1.assign(end = df1.date + pd.DateOffset(minutes=9, seconds=59)).conditional_join(df2, ('date', 'end_date', '<='), ('end', 'start_date','>='), use_numba = True)
12.4 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

indicator parameter usage:

In [5]: df1 = pd.DataFrame({'id': [1,1,1,2,2,3],
                        'value_1': [2,5,7,1,3,4]})
   
   df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3],
                        'value_2A': [0,3,7,12,0,2,3,1],
                       'value_2B': [1,5,9,15,1,4,6,3]})

In [6]: (df1
   .conditional_join(
        df2,
        ('id', 'id', '=='),
        ('value_1', 'value_2A', '>='),
       ('value_1', 'value_2B', '<='),
        how='left',
       sort_by_appearance=True,
        indicator=True,
       # this ensures only the `df` dataframe is returned
        right_columns=None
    ))
Out[6]: 
   id  value_1     _merge
0   1        2  left_only
1   1        5       both
2   1        7       both
3   2        1       both
4   2        3       both
5   2        3       both
6   3        4  left_only

PR Checklist

Please ensure that you have done the following:

PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.

If you're not on the contributors list, add yourself to AUTHORS.md.

Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

Building a preview of the docs on Netlify
Automatically linting the code
Making sure the code is documented
Making sure that all tests are passed
Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl

…oin check

ericmjl · 2022-12-17T09:35:01Z

🚀 Deployed on https://deploy-preview-1224--pyjanitor.netlify.app

codecov · 2022-12-17T09:41:05Z

Codecov Report

Merging #1224 (d52ccde) into dev (7ad98e3) will increase coverage by 15.13%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##              dev    #1224       +/-   ##
===========================================
+ Coverage   82.51%   97.64%   +15.13%     
===========================================
  Files          78       78               
  Lines        3770     3617      -153     
===========================================
+ Hits         3111     3532      +421     
+ Misses        659       85      -574

ericmjl · 2022-12-19T22:54:16Z

Hey @samukweku, it looked like the PR was heading in a good direction. Any reason for closing?

samukweku · 2022-12-20T00:16:16Z

I just did a minor speed test and noticed a regression. Need to spend some more time on the logic. I'll open it again if I can improve on the logic. If not I'll keep the existing logic

ericmjl · 2022-12-20T00:21:07Z

I love the rigor here, @samukweku! ❤️

…or scenarios where the range columns are both monotonically increasing when sorted

ericmjl · 2022-12-28T19:11:25Z

@samukweku when you're ready for review, do let us know!

Maybe we should also try using draft vs. actual PRs in the future?

samukweku · 2022-12-29T07:18:29Z

@ericmjl my bad... It is ready for review. Thanks for the draft suggestion... I will def use that for future PRs

ericmjl · 2023-01-28T21:21:46Z

@samukweku in the interest of keeping things moving, I will merge this when all of the CI checks are done.

ericmjl · 2023-02-14T13:51:34Z

@samukweku I think this PR can be merged once the pre-commit hooks are satisfied!

samukweku · 2023-02-14T19:38:37Z

@ericmjl I think there is an issue with the pre commit ci - locally everything checks fine (can't fit push without satisfying d pre commit hooks)

samukweku · 2023-02-15T23:04:18Z

@ericmjl the precommit hooks times out - no idea why :

ericmjl · 2023-03-02T18:36:37Z

No worries, @samukweku. I realize that this PR is quite old now. I'm going to clean up the PRs here by reviewing or merging them, starting with this one!

¨samukweku¨ and others added 15 commits November 19, 2022 15:00

úpdates´

d685085

updates

b4b42f1

update comments

e6d284c

spelling fix

3664aa5

minor clean

cf6046f

better logic for non equi join check

518ac9e

better logic for non equi join check

5d1a9ef

better logic for non equi join check

8f5ffd5

restore strictness on column dtypes and simplify logic for non equi j…

7eae484

…oin check

fix logic

bafdeee

allow selection of either only df or right

ecb968b

add indicator parameter

a157f39

add tests for indicator parameter

bc7e72c

add version added to docs

0f52a48

changelog

2c72d9a

samukweku requested review from Zeroto521, thatlittleboy and ericmjl December 17, 2022 09:30

samukweku self-assigned this Dec 17, 2022

samukweku added 2 commits December 18, 2022 13:37

improve perf for empty dataframe

2220d72

simplify logic

b809f52

samukweku closed this Dec 19, 2022

samukweku deleted the samukweku/cond_join_fix branch December 19, 2022 22:07

samukweku restored the samukweku/cond_join_fix branch December 20, 2022 00:15

performance improvement for range joins if use_numba=False - occurs f…

06a2af7

…or scenarios where the range columns are both monotonically increasing when sorted

ericmjl and others added 2 commits December 28, 2022 14:12

Merge dev into samukweku/cond_join_fix

366cb8f

use np.delete instead of np.bincount

4aa0cd6

samukweku and others added 12 commits January 14, 2023 23:27

remove dead code

d246fd3

fix type hint

8a95714

Merge branch 'dev' into samukweku/cond_join_fix

85f325e

rebase

6d8e995

rebase

4f7b69e

rebase

05edb23

rebase

0af8d69

rebase

00fde1a

cleanup

fc59d26

fix conditional_join.ipynb

aa5332a

Merge branch 'dev' into samukweku/cond_join_fix

b8a06bb

Merge dev into samukweku/cond_join_fix

4f1598a

ericmjl approved these changes Jan 28, 2023

View reviewed changes

ericmjl and others added 5 commits January 28, 2023 22:48

Merge dev into samukweku/cond_join_fix

fe8f206

Merge dev into samukweku/cond_join_fix

7a300b0

Merge branch 'dev' into samukweku/cond_join_fix

08a6c7b

edit docs to match updated google style

ba06730

Merge dev into samukweku/cond_join_fix

84b1c99

update

991a739

Merge dev into samukweku/cond_join_fix

d52ccde

ericmjl merged commit 468aeb4 into dev Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Improve `conditional_join` #1224

[ENH] Improve `conditional_join` #1224

samukweku commented Dec 17, 2022 •

edited

ericmjl commented Dec 17, 2022 •

edited

codecov bot commented Dec 17, 2022 •

edited

ericmjl commented Dec 19, 2022

samukweku commented Dec 20, 2022

ericmjl commented Dec 20, 2022

ericmjl commented Dec 28, 2022

samukweku commented Dec 29, 2022

ericmjl commented Jan 28, 2023

ericmjl commented Feb 14, 2023

samukweku commented Feb 14, 2023

samukweku commented Feb 15, 2023

ericmjl commented Mar 2, 2023

[ENH] Improve conditional_join #1224

[ENH] Improve conditional_join #1224

Conversation

samukweku commented Dec 17, 2022 • edited

PR Description

PR Checklist

Automatic checks

Relevant Reviewers

ericmjl commented Dec 17, 2022 • edited

codecov bot commented Dec 17, 2022 • edited

Codecov Report

ericmjl commented Dec 19, 2022

samukweku commented Dec 20, 2022

ericmjl commented Dec 20, 2022

ericmjl commented Dec 28, 2022

samukweku commented Dec 29, 2022

ericmjl commented Jan 28, 2023

ericmjl commented Feb 14, 2023

samukweku commented Feb 14, 2023

samukweku commented Feb 15, 2023

ericmjl commented Mar 2, 2023

[ENH] Improve `conditional_join` #1224

[ENH] Improve `conditional_join` #1224

samukweku commented Dec 17, 2022 •

edited

ericmjl commented Dec 17, 2022 •

edited

codecov bot commented Dec 17, 2022 •

edited