Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report merge stats in joins (enhancement) #25

Closed
lpuettmann opened this issue May 19, 2019 · 4 comments
Closed

Report merge stats in joins (enhancement) #25

lpuettmann opened this issue May 19, 2019 · 4 comments

Comments

@lpuettmann
Copy link

@lpuettmann lpuettmann commented May 19, 2019

Hey Benjamin,

Thanks for the cool package!

Wouldn't it be useful to report what share of rows have been dropped?

So for this:

x <- inner_join(band_members, band_instruments, by = "name")

show:
33% of left dataframe and 33% of right dataframe's rows dropped.

I find that these joins are often a great source for bugs in my programs and it often has to do with losing many rows in the either left or right dataframe.

Best,
Lukas

@elbersb
Copy link
Owner

@elbersb elbersb commented May 20, 2019

Hey,
right now, tidylog already reports this:

x <- inner_join(band_members, band_instruments, by = "name")                
#> inner_join: removed one row and added one column (plays)

It's tricky to figure out how many rows are dropped from the "right" data frame, because then tidylog needs to figure out which rows have been merged, no? Of course it would be possible to add the percentage share of rows dropped to the left data frame.

@rubenarslan
Copy link

@rubenarslan rubenarslan commented Jul 19, 2019

I'd be interested to see:
rows in x not in y, rows in y not in x, rows in x and y (maybe as percentages of totals or also showing total number of rows), as well as number of duplicates for the by variables on each side.

I know this is much more extensive and verbose than what you have now, but my thinking goes that mistakes happen often enough that further logging would be very useful.

Some code here

@elbersb
Copy link
Owner

@elbersb elbersb commented Jul 25, 2019

I started working on this on a new branch joins. If you have time, could you check it out?
Install via:

devtools::install_github("elbersb/tidylog@joins")

Now the join commands print something like this:

> tidylog::left_join(flights[1:10000, ], airlines[1:10, ], by = "carrier")
#>left_join: added one column (name)
#>           rows only in x    2,783
#>           rows only in y  (     0)
#>           matched rows      7,217
#>                           ========
#>           rows total       10,000

Any time a number is printed in parentheses, it means that those rows are not included in the result. For instance:

> tidylog::anti_join(flights[1:100, ], airlines[1:10, ], by = "carrier")
#>anti_join: added no new columns
#>           rows only in x   34
#>           rows only in y  ( 3)
#>           matched rows    (66)
#>                           ====
#>           rows total       34

Not very well tested until now, so there will be issues...

@elbersb elbersb closed this as completed Aug 7, 2019
@elbersb
Copy link
Owner

@elbersb elbersb commented Aug 7, 2019

This is now on the main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants