Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] right_join() function does not produce the expected outcome #14907

Closed
abduazizR opened this issue Dec 10, 2022 · 8 comments · Fixed by #15077
Closed

[R] right_join() function does not produce the expected outcome #14907

abduazizR opened this issue Dec 10, 2022 · 8 comments · Fixed by #15077
Assignees
Milestone

Comments

@abduazizR
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Hi,

I noticed something strange today when I was using arrow datasets. I cannot give a reproducible example but you can get the idea from the code below. I have ccaei as an arrow dataset. When I try to use right_join() with an R tibble before using collect(), it gives me wrong numbers (the number of distinct ENROLID is less than that present in outpatients). I get the correct number when I use right_join() after using collect(), although this is computationally inefficient. Could you help me with this?

This gives a really weird number

ccaei |>  
  filter(ADMDATE >= as_date("2016-10-01")) |> 
  filter(!is.na(ENROLID)) |> 
  select(ENROLID, ADMDATE) |> 
  right_join(outpatients) |> 
  collect() |> count(ENROLID)

This makes sense

ccaei |>  
  filter(ADMDATE >= as_date("2016-10-01")) |> 
  filter(!is.na(ENROLID)) |> 
  select(ENROLID, ADMDATE) |> 
  collect() |> 
    right_join(outpatients) |> 
  count(ENROLID)

Not sure where the mistake came from.

Component(s)

R

@thisisnic
Copy link
Member

thisisnic commented Dec 11, 2022

Hi @abduazizR , thanks for reporting this! Just to check, which version of Arrow are you using?

I understand that your data may be confidential, but is it possible to create a reproducible example using in-built datasets instead?

@abduazizR
Copy link
Author

Hi Nic,

Thanks for responding. I am using the latest version of Arrow 10.0.1

In the example I posted, ccaei is a dataset I built using multiple parquet files. I will attach an example, but it does not show exactly the problem I faced.

library(tidyverse)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(duckdb)
#> Loading required package: DBI

data1 <- tibble(
  id = 1:100,
  x = rnorm(100)
)

data2 <- tibble(
  id = 52:69,
  z = rbinom(18, size = 18, prob = 0.5)
)

data2_arrow <- data2 |> as_arrow_table()

# The weired right_join output
data2_arrow |> 
  right_join(data1) |> collect() |> 
  # The ids from data 1 are missing
  tail(10)
#> # A tibble: 10 × 3
#>       id     z      x
#>    <int> <int>  <dbl>
#>  1    NA    NA -0.369
#>  2    NA    NA  0.470
#>  3    NA    NA -0.372
#>  4    NA    NA  0.505
#>  5    NA    NA  1.47 
#>  6    NA    NA -0.858
#>  7    NA    NA -0.330
#>  8    NA    NA  2.02 
#>  9    NA    NA  1.53 
#> 10    NA    NA -0.399

# The expected right_join output 
data2 |> 
  right_join(data1) |> 
  tail(10)
#> Joining, by = "id"
#> # A tibble: 10 × 3
#>       id     z      x
#>    <int> <int>  <dbl>
#>  1    91    NA -0.369
#>  2    92    NA  0.470
#>  3    93    NA -0.372
#>  4    94    NA  0.505
#>  5    95    NA  1.47 
#>  6    96    NA -0.858
#>  7    97    NA -0.330
#>  8    98    NA  2.02 
#>  9    99    NA  1.53 
#> 10   100    NA -0.399

Created on 2022-12-11 with reprex v2.0.2

With my data, right_join is acting as inner_join. As a result, I don't even have rows with NAs in the ID column.

Thank you.

@thisisnic
Copy link
Member

Thanks for the reprex there, this looks similar to an issue we had back in ARROW-15838 though there has been work done to fix it - I'm suspecting we might need to do something extra to make this work for right_join().

I'll take a look into what's going on in the example in the reprex, which hopefully should also fix the original problem you had, though it's not guaranteed - if you do find a reprex that captures that, feel free to add that here.

I can confirm that I can reproduce the error from the reprex in the latest dev version of Arrow.

@abduazizR
Copy link
Author

Thanks for the response. Can I rely on left_join() for now until the issue with right_join() gets resolved?

@thisisnic
Copy link
Member

@abduazizR It looks like that issue is restricted to right_join() only - see the reprex below:

library(arrow)
library(dplyr)

tbl <- tibble::tibble(x = 1:10, y = 1:10)
tbl2 <- tibble::tibble(y = 3:5, z = c("a", "b", "c"))

# right join using dplyr
tbl2 %>%
  right_join(tbl)
#> Joining, by = "y"
#> # A tibble: 10 × 3
#>        y z         x
#>    <int> <chr> <int>
#>  1     3 a         3
#>  2     4 b         4
#>  3     5 c         5
#>  4     1 <NA>      1
#>  5     2 <NA>      2
#>  6     6 <NA>      6
#>  7     7 <NA>      7
#>  8     8 <NA>      8
#>  9     9 <NA>      9
#> 10    10 <NA>     10

# right join using Arrow
tbl2 %>%
  arrow_table() %>%
  right_join(tbl) %>%
  collect()
#> # A tibble: 10 × 3
#>        y z         x
#>    <int> <chr> <int>
#>  1     3 a         3
#>  2     4 b         4
#>  3     5 c         5
#>  4    NA <NA>      1
#>  5    NA <NA>      2
#>  6    NA <NA>      6
#>  7    NA <NA>      7
#>  8    NA <NA>      8
#>  9    NA <NA>      9
#> 10    NA <NA>     10

# left join using dplyr
tbl %>%
  left_join(tbl2)
#> Joining, by = "y"
#> # A tibble: 10 × 3
#>        x     y z    
#>    <int> <int> <chr>
#>  1     1     1 <NA> 
#>  2     2     2 <NA> 
#>  3     3     3 a    
#>  4     4     4 b    
#>  5     5     5 c    
#>  6     6     6 <NA> 
#>  7     7     7 <NA> 
#>  8     8     8 <NA> 
#>  9     9     9 <NA> 
#> 10    10    10 <NA>

# left join using Arrow
tbl %>%
  arrow_table() %>%
  left_join(tbl2) %>%
  collect()
#> # A tibble: 10 × 3
#>        x     y z    
#>    <int> <int> <chr>
#>  1     3     3 a    
#>  2     4     4 b    
#>  3     5     5 c    
#>  4     1     1 <NA> 
#>  5     2     2 <NA> 
#>  6     6     6 <NA> 
#>  7     7     7 <NA> 
#>  8     8     8 <NA> 
#>  9     9     9 <NA> 
#> 10    10    10 <NA>

wjones127 pushed a commit that referenced this issue Jan 4, 2023
…come (#15077)

* Closes: #14907

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 added this to the 11.0.0 milestone Jan 4, 2023
EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this issue Jan 5, 2023
…ed outcome (apache#15077)

* Closes: apache#14907

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue Jan 9, 2023
…ed outcome (apache#15077)

* Closes: apache#14907

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@annakrystalli
Copy link

Hello!

I was wondering if this has been resolved as I'm still coming up against behaviour that differs from dplyr behaviour in both left_join() and right_join() (see reprex below) in arrow version 12.0.1.1.

What appears to me to be the difference is that while dplyr matches NA values by default, arrow does not seem to.

Am I missing sth?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

tbl1 <- tibble::tibble(
  a = 1:3,
  b = c("a", "b", NA),
  d = c(letters[4:6])
)

tbl2 <- tibble::tibble(
  b = c("b", NA),
  c = c("a should be 2", "a should be 3")
)

# Left join tibbles, NAs matched
left_join(tbl2, tbl1)
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3     3 f

# Left join arrow table & tibble, NAs NOT matched
left_join(as_arrow_table(tbl2), tbl1) %>% collect()
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3    NA <NA>
# Left join arrow table & arrow table, NAs NOT matched
left_join(as_arrow_table(tbl2), as_arrow_table(tbl1)) %>% collect()
#> # A tibble: 2 × 4
#>   b     c                 a d    
#>   <chr> <chr>         <int> <chr>
#> 1 b     a should be 2     2 e    
#> 2 <NA>  a should be 3    NA <NA>


# Same with right_join. It appears that the NAs are the problem as arrow doesn't seem to count them as a match
right_join(tbl1, tbl2)
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#>       a b     d     c            
#>   <int> <chr> <chr> <chr>        
#> 1     2 b     e     a should be 2
#> 2     3 <NA>  f     a should be 3
right_join(as_arrow_table(tbl1), as_arrow_table(tbl2)) %>% collect()
#> # A tibble: 2 × 4
#>       a b     d     c            
#>   <int> <chr> <chr> <chr>        
#> 1     2 b     e     a should be 2
#> 2    NA <NA>  <NA>  a should be 3
right_join(as_arrow_table(tbl1), tbl2) %>% collect()
#> # A tibble: 2 × 4
#>       a b     d     c            
#>   <int> <chr> <chr> <chr>        
#> 1    NA <NA>  <NA>  a should be 3
#> 2     2 b     e     a should be 2

Created on 2023-09-26 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Ventura 13.5.2
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Athens
#>  date     2023-09-26
#>  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version  date (UTC) lib source
#>  arrow       * 12.0.1.1 2023-07-18 [1] CRAN (R 4.2.0)
#>  assertthat    0.2.1    2019-03-21 [1] CRAN (R 4.2.0)
#>  bit           4.0.5    2022-11-15 [1] CRAN (R 4.2.0)
#>  bit64         4.0.5    2020-08-30 [1] CRAN (R 4.2.0)
#>  cli           3.6.1    2023-03-23 [1] CRAN (R 4.2.0)
#>  digest        0.6.33   2023-07-07 [1] CRAN (R 4.2.0)
#>  dplyr       * 1.1.3    2023-09-03 [1] CRAN (R 4.2.0)
#>  evaluate      0.20     2023-01-17 [1] CRAN (R 4.2.0)
#>  fansi         1.0.4    2023-01-22 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.1    2023-02-24 [1] CRAN (R 4.2.0)
#>  fs            1.6.3    2023-07-20 [1] CRAN (R 4.2.0)
#>  generics      0.1.3    2022-07-05 [1] CRAN (R 4.2.1)
#>  glue          1.6.2    2022-02-24 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.6    2023-08-10 [1] CRAN (R 4.2.0)
#>  knitr         1.42     2023-01-25 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.3    2022-10-07 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3    2022-03-30 [1] CRAN (R 4.2.0)
#>  pillar        1.9.0    2023-03-22 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3    2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         1.0.2    2023-08-10 [1] CRAN (R 4.2.0)
#>  R.cache       0.16.0   2022-07-21 [1] CRAN (R 4.2.0)
#>  R.methodsS3   1.8.2    2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0   2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.2   2022-11-11 [1] CRAN (R 4.2.0)
#>  R6            2.5.1    2021-08-19 [1] CRAN (R 4.2.0)
#>  reprex        2.0.2    2022-08-17 [3] CRAN (R 4.2.0)
#>  rlang         1.1.1    2023-04-28 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.21     2023-03-26 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.14     2022-08-22 [1] CRAN (R 4.2.1)
#>  sessioninfo   1.2.2    2021-12-06 [3] CRAN (R 4.2.0)
#>  styler        1.7.0    2022-03-13 [1] CRAN (R 4.2.0)
#>  tibble        3.2.1    2023-03-20 [1] CRAN (R 4.2.0)
#>  tidyselect    1.2.0    2022-10-10 [1] CRAN (R 4.2.0)
#>  utf8          1.2.3    2023-01-31 [1] CRAN (R 4.2.0)
#>  vctrs         0.6.3    2023-06-14 [1] CRAN (R 4.2.0)
#>  withr         2.5.0    2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.39     2023-04-20 [1] CRAN (R 4.2.0)
#>  yaml          2.3.7    2023-01-23 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Users/Anna/Library/R/arm64/4.2/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/site-library
#>  [3] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

@thisisnic
Copy link
Member

Hi @annakrystalli, the problem described in the original issue has been fixed, but the problem you are describing in your reprex there is something else. I've opened up #37902 as a feature request, and have added some more explanation there - happy to discuss further on that ticket

@annakrystalli
Copy link

Great! Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants