Skip to content

feat(r/sedonadb): Add join expression evaluation#781

Merged
paleolimbot merged 18 commits into
apache:mainfrom
paleolimbot:r-join-eval-again
Apr 29, 2026
Merged

feat(r/sedonadb): Add join expression evaluation#781
paleolimbot merged 18 commits into
apache:mainfrom
paleolimbot:r-join-eval-again

Conversation

@paleolimbot
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot commented Apr 23, 2026

Adds sd_join() to the R bindings with friendly specification of the join condition and the output selection. These are both a huge pain and are very verbose to deal with...there are a lot of ways to specify join keys and a lot of ways to deal with disambiguating names on the output. I implemented roughly how this is done in dplyr with join_by() and suffix with an escape hatch for other types of selections one might want to do.

library(sedonadb)
library(nycflights13)

res <- nycflights13::flights |> 
  sd_select(year, month, day, flight, tailnum) |> 
  sd_join(
    nycflights13::planes |> sd_select(tailnum, type, manufacturer),
    select = sd_join_select_default()
  ) |> 
  sd_group_by(manufacturer) |> 
  sd_summarise(n = n()) |> 
  sd_arrange(desc(n))

res
#> <sedonab_dataframe: NA x 2>
#> ┌───────────────────────────────┬───────┐
#> │          manufacturer         ┆   n   │
#> │              utf8             ┆ int64 │
#> ╞═══════════════════════════════╪═══════╡
#> │ BOEING                        ┆ 82912 │
#> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
#> │ EMBRAER                       ┆ 66068 │
#> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
#> │ AIRBUS                        ┆ 47302 │
#> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
#> │ AIRBUS INDUSTRIE              ┆ 40891 │
#> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
#> │ BOMBARDIER INC                ┆ 28272 │
#> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
#> │ MCDONNELL DOUGLAS AIRCRAFT CO ┆  8932 │
#> └───────────────────────────────┴───────┘
#> Preview of up to 6 row(s)

bench::mark(as.data.frame(res))
#> # A tibble: 1 × 6
#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 as.data.frame(res)   2.75ms   3.13ms      307.    31.8KB        0

bench::mark(
res <- nycflights13::flights |> 
  dplyr::select(year, month, day, flight, tailnum) |> 
  dplyr::inner_join(
    nycflights13::planes |> dplyr::select(tailnum, type, manufacturer),
    by = "tailnum"
  ) |> 
  dplyr::count(manufacturer) |> 
  dplyr::arrange(desc(n))
)
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                          <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 res <- dplyr::arrange(dplyr::count… 17.2ms 18.8ms      47.1      44MB     101.

res
#> # A tibble: 35 × 2
#>    manufacturer                      n
#>    <chr>                         <int>
#>  1 BOEING                        82912
#>  2 EMBRAER                       66068
#>  3 AIRBUS                        47302
#>  4 AIRBUS INDUSTRIE              40891
#>  5 BOMBARDIER INC                28272
#>  6 MCDONNELL DOUGLAS AIRCRAFT CO  8932
#>  7 MCDONNELL DOUGLAS              3998
#>  8 CANADAIR                       1594
#>  9 MCDONNELL DOUGLAS CORPORATION  1259
#> 10 CESSNA                          658
#> # ℹ 25 more rows

This also works with st_intersects() and friends for a spatial join:

library(sedonadb)

cities <- sd_read_parquet(
  "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_cities.parquet"
)

countries <- sd_read_parquet(
  "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_countries.parquet"
)

countries |> 
  sd_join(
    cities,
    by = sd_join_by(st_intersects(x$geometry, y$geometry)),
    select = sd_join_select(
      city = y$name,
      country = x$name,
      continent,
      geometry = y$geometry
    )
  )
#> <sedonab_dataframe: NA x 4>
#> ┌──────────────┬────────────────┬───────────┬───────────────────────────────┐
#> │     city     ┆     country    ┆ continent ┆            geometry           │
#> │     utf8     ┆      utf8      ┆    utf8   ┆            geometry           │
#> ╞══════════════╪════════════════╪═══════════╪═══════════════════════════════╡
#> │ Vatican City ┆ Italy          ┆ Europe    ┆ POINT(12.4533865 41.9032822)  │
#> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
#> │ San Marino   ┆ Italy          ┆ Europe    ┆ POINT(12.4417702 43.9360958)  │
#> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
#> │ Vaduz        ┆ Austria        ┆ Europe    ┆ POINT(9.5166695 47.1337238)   │
#> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
#> │ Lobamba      ┆ eSwatini       ┆ Africa    ┆ POINT(31.1999971 -26.4666675) │
#> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
#> │ Luxembourg   ┆ Luxembourg     ┆ Europe    ┆ POINT(6.1300028 49.6116604)   │
#> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
#> │ Bir Lehlou   ┆ Western Sahara ┆ Africa    ┆ POINT(-9.6525222 26.1191667)  │
#> └──────────────┴────────────────┴───────────┴───────────────────────────────┘
#> Preview of up to 6 row(s)

Created on 2026-04-24 with reprex v2.1.1

@github-actions github-actions Bot requested a review from zhangfengcdt April 23, 2026 22:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds join-expression support to the R sedonadb bindings, enabling sd_join() with dplyr-like join condition specification (sd_join_by()) and post-join column selection/disambiguation (sd_join_select_default(), sd_join_select()), backed by new Rust FFI methods for join execution and expression introspection.

Changes:

  • Introduces sd_join(), sd_join_by(), sd_join_select_default(), and sd_join_select() plus join expression evaluation utilities.
  • Extends Rust/R FFI to support DataFusion join_on() and adds expression inspection helpers (qualified_name(), variant_name(), parse_binary()).
  • Adds comprehensive testthat coverage and snapshots for join-expression behavior and default selection rules.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
r/sedonadb/R/join-expression.R New join condition/select specification and join-expression evaluation + default post-join projection logic.
r/sedonadb/R/dataframe.R Adds sd_join() and extends sd_summarise()/sd_summarize() with .env.
r/sedonadb/R/expression.R Adds sd_expr_parse_binary() and makes expression masks drop duplicate column names.
r/sedonadb/R/000-wrappers.R Adds generated R wrappers for InternalDataFrame$join() and new SedonaDBExpr inspection methods.
r/sedonadb/src/rust/src/dataframe.rs Adds InternalDataFrame::join() using DataFusion join_on() with aliases and parsed JoinType.
r/sedonadb/src/rust/src/expression.rs Exposes qualified_name(), variant_name(), and parse_binary() over FFI for R-side logic.
r/sedonadb/src/rust/api.h Declares new FFI symbols for join and expression inspection.
r/sedonadb/src/init.c Registers new .Call entry points for join + expression inspection.
r/sedonadb/NAMESPACE Exports new join APIs and S3 methods for printing and $ table refs.
r/sedonadb/tests/testthat/test-join-expression.R New test suite for join-by/select evaluation, ambiguity errors, and default projection behavior.
r/sedonadb/tests/testthat/_snaps/join-expression.md Snapshot outputs for join-expression printing and evaluated expressions.
r/sedonadb/tests/testthat/test-dataframe.R Adds an integration-style test ensuring select behavior is applied to join results.
r/sedonadb/tests/testthat/test-expression.R Adds tests for new expression inspection helpers.
r/sedonadb/man/sd_join.Rd New user-facing docs for sd_join().
r/sedonadb/man/sd_join_by.Rd New user-facing docs for sd_join_by().
r/sedonadb/man/sd_join_select.Rd New user-facing docs for sd_join_select().
r/sedonadb/man/sd_join_select_default.Rd New user-facing docs for sd_join_select_default().
r/sedonadb/man/sd_expr_column.Rd Adds alias/doc entry for sd_expr_parse_binary().
r/sedonadb/man/sd_summarise.Rd Documents new .env parameter for sd_summarise()/sd_summarize().
r/sedonadb/.Rbuildignore Ignores local AI assistant marker files.
.pre-commit-config.yaml Excludes testthat snapshot directory from trailing-whitespace hook.
.gitignore Ignores .positai.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread r/sedonadb/R/join-expression.R Outdated
Comment thread r/sedonadb/R/join-expression.R
Comment thread r/sedonadb/R/join-expression.R Outdated
Comment thread r/sedonadb/R/join-expression.R Outdated
Comment thread r/sedonadb/R/dataframe.R
Comment thread r/sedonadb/R/dataframe.R Outdated
Comment thread r/sedonadb/tests/testthat/test-dataframe.R
@paleolimbot paleolimbot marked this pull request as ready for review April 27, 2026 15:16
@paleolimbot paleolimbot requested a review from Copilot April 27, 2026 15:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@paleolimbot paleolimbot requested a review from Copilot April 27, 2026 18:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread r/sedonadb/R/join-expression.R Outdated
Comment thread r/sedonadb/R/join-expression.R Outdated
Comment thread r/sedonadb/R/join-expression.R
@paleolimbot
Copy link
Copy Markdown
Member Author

If there are no objections I'll merge this sometime tomorrow!

@paleolimbot paleolimbot merged commit 6110f43 into apache:main Apr 29, 2026
9 checks passed
@paleolimbot paleolimbot deleted the r-join-eval-again branch April 29, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants