Ballista should not have separate DataFrame implementation #2

andygrove · 2021-04-18T18:44:14Z

hen building the Ballista POC it was necessary to implement a new DataFrame API that wrapped the DataFusion API.

One issue is that it wasn't possible to override the behavior of the collect method to make it use the Ballista context rather than the DataFusion context.

Now that the projects are in the same repo it should be easier to fix this and have users always use the DataFusion DataFrame API.

* wip * more * Make scalar.rs compile

# This is the 1st commit message: Add Display for Expr::BinaryExpr # This is the commit message #2: Update logical_plan/operators tests # This is the commit message #3: rebase and debug display for non binary expr

* # This is a combination of 3 commits. # This is the 1st commit message: Add Display for Expr::BinaryExpr # This is the commit message #2: Update logical_plan/operators tests # This is the commit message #3: rebase and debug display for non binary expr * Add Display for Expr::BinaryExpr Update logical_plan/operators tests rebase and debug display for non binary expr Add Display for Expr::BinaryExpr Update logical_plan/operators tests Updating tests Update aggregate display Updating tests without aggregate More tests Working on agg/scalar functions Fix binary_expr in create_name function and attendant tests More tests More tests Doc tests Rebase and update new tests * Submodule update * Restore submodule references from master Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* [feat] stubs for provider re-organization * [feat] implement infer_schema to make test pass * [wip] trying to implement pruned_partition_list * [typo] * [fix] replace enum with trait for extensibility * [fix] add partition cols to infered schema * [feat] forked file format executors avro still missing * [doc] comments about why we are flattening * [test] migrated tests to file formats * [test] improve listing test * [feat] add avro to refactored format providers * [fix] remove try from new when unnecessary * [fix] remove try_ from ListingTable new * [refacto] renamed format module to file_format also removed statistics from the PartitionedFile abstraction * [fix] removed Ballista stubs * [fix] rename create_executor * [feat] added store * [fix] Clippy * [test] improve file_format tests with limit * [fix] limit file system read size * [fix] avoid fetching unnecessary stats after limit * [fix] improve readability * [doc] improve comments * [refacto] keep async reader stub * [doc] cleanup comments * [test] test file listing * [fix] add last_modified back * [refacto] simplify csv reader exec * [refacto] change SizedFile back to FileMeta * [doc] comment clarification * [fix] avoid keeping object store as field * [refacto] grouped params to avoid too_many_arguments * [fix] get_by_uri also returns path * [fix] ListingTable at store level instead of registry * [fix] builder take self and not ref to self * Replace file format providers (#2) * [fix] replace file format providers in datafusion * [lint] clippy * [fix] replace file format providers in ballista * [fix] await in python wrapper * [doc] clearer doc about why sql() is async * [doc] typos and clarity * [fix] missing await after rebase

初始提交

完善udf插件化代码

* Optimize `regex_replace` for scalar patterns * Change the hot-path on `regexp_replace` to only variadic source (#2)

* Initial commit * initial commit * failing test * table scan projection * closer * test passes, with some hacks * use DataFrame (#2) * update README * update dependency * code cleanup (#3) * Add support for Filter operator and BinaryOp expressions (#4) * GitHub action (#5) * Split code into producer and consumer modules (#6) * Support more functions and scalar types (#7) * Use substrait 0.1 and datafusion 8.0 (#8) * use substrait 0.1 * use datafusion 8.0 * update datafusion to 10.0 and substrait to 0.2 (#11) * Add basic join support (#12) * Added fetch support (#23) Added fetch to consumer Added limit to producer Added unit tests for limit Added roundtrip_fill_none() for testing when None input can be converted to 0 Update src/consumer.rs Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> * Upgrade to DataFusion 13.0.0 (#25) * Add sort consumer and producer (#24) Add consumer Add producer and test Modified error string * Add serializer/deserializer (#26) * Add plan and function extension support (#27) * Add plan and function extension support * Removed unwraps * Implement GROUP BY (#28) * Add consumer, producer and tests for aggregate relation Change function extension registration from absolute to relative anchor (reference) Remove operator to/from reference * Fixed function registration bug * Add test * Addressed PR comments * Changed field reference from mask to direct reference (#29) * Changed field reference from masked reference to direct reference * Handle unsupported case (struct with child) * Handle SubqueryAlias (#30) Fixed aggregate function register bug * Add support for SELECT DISTINCT (#31) Add test case * Implement BETWEEN (#32) * Add case (#33) * Implement CASE WHEN * Add more case to test * Addressed comments * feat: support explicit catalog/schema names in ReadRel (#34) * feat: support explicit catalog/schema names in ReadRel Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix: use re-exported expr crate Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * move files to subfolder * RAT * remove rust.yaml * revert .gitignore changes * tomlfmt * tomlfmt Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: JanKaul <jankaul@mailbox.org> Co-authored-by: nseekhao <37189615+nseekhao@users.noreply.github.com> Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

* refactor `TreeNode::rewrite()` * use handle_tree_recursion in `Expr` * use macro for transform recursions * fix api * minor fixes * fix * don't trust `t.transformed` coming from transformation closures, keep the old way of detecting if changes were made * rephrase todo comment, always propagate up `t.transformed` from the transformation closure, fix projection pushdown closure * Fix `TreeNodeRecursion` docs * extend Skip (Prune) functionality to Jump as it is defined in https://synnada.notion.site/synnada/TreeNode-Design-Proposal-bceac27d18504a2085145550e267c4c1 * fix Jump and add tests * jump test fixes * fix clippy * unify "transform" traversals using macros, fix "visit" traversal jumps, add visit jump tests, ensure consistent naming `f` instead of `op`, `f_down` instead of `pre_visit` and `f_up` instead of `post_visit` * fix macro rewrite * minor fixes * minor fix * refactor tests * add transform tests * add apply, transform_down and transform_up tests * refactor tests * test jump on both a and e nodes in both top-down and bottom-up traversals * better transform/rewrite tests * minor fix * simplify tests * add stop tests, reorganize tests * fix previous merges and remove leftover file * Review TreeNode Refactor (#1) * Minor changes * Jump doesn't ignore f_up * update test * Update rewriter * LogicalPlan visit update and propagate from children flags * Update tree_node.rs * Update map_children's --------- Co-authored-by: Mustafa Akur <mustafa.akur@synnada.ai> * fix * minor fixes * fix f_up call when f_down returns jump * simplify code * minor fix * revert unnecessary changes * fix `DynTreeNode` and `ConcreteTreeNode` `transformed` and `tnr` propagation * introduce TransformedResult helper * fix docs * restore transform as alias to trassform_up * restore transform as alias to trassform_up 2 * Simplifications and comment improvements (#2) --------- Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> Co-authored-by: Mustafa Akur <mustafa.akur@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>

andygrove added the ballista label Apr 18, 2021

andygrove changed the title ~~[Ballista] Ballista should not have separate DataFrame implementation~~ Ballista should not have separate DataFrame implementation Apr 20, 2021

andygrove mentioned this issue Apr 24, 2021

Remove Ballista DataFrame #48

Merged

andygrove self-assigned this Apr 24, 2021

andygrove closed this as completed in #48 Apr 25, 2021

Dandandan mentioned this issue May 13, 2021

Vectorized hashing for hash aggregation code #336

Closed

yjshen referenced this issue in yjshen/datafusion Sep 16, 2021

Make scalar.rs compile (#2)

00df64a

* wip * more * Make scalar.rs compile

Igosuki added a commit to Igosuki/arrow-datafusion that referenced this issue Jan 14, 2022

Fix tests apache#2

5cee11d

EricJoy2048 referenced this issue in argoengine/arrow-datafusion Feb 23, 2022

First first submit (#2)

ed4a5ba

初始提交

EricJoy2048 added a commit to EricJoy2048/arrow-datafusion that referenced this issue Mar 2, 2022

完善udf插件化代码 (apache#2)

7570fd3

完善udf插件化代码

andygrove mentioned this issue May 23, 2022

Move logical optimizer rules out of the core datafusion crate #2599

Closed

1 task

Dandandan pushed a commit that referenced this issue Sep 27, 2022

Optimize regex_replace for scalar patterns (#3614)

15c19c3

* Optimize `regex_replace` for scalar patterns * Change the hot-path on `regexp_replace` to only variadic source (#2)

This was referenced Jan 13, 2024

dataframe.with_column_rename has unintuitive behavior when using case sensitive column names #8800

Closed

support to_timestamp with optional chrono formats #8886

Merged

alamb mentioned this issue Feb 6, 2024

Finalize SIGMOD 2024 paper ~(if accepted)~ #8373

Closed

5 tasks

jayzhan211 mentioned this issue Apr 24, 2024

Move create_physical_expr to phy-expr-common #3 #10188

Closed

Omega359 mentioned this issue Jun 5, 2024

Feedback request for providing configurable UDF functions #10744

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ballista should not have separate DataFrame implementation #2

Ballista should not have separate DataFrame implementation #2

andygrove commented Apr 18, 2021

Ballista should not have separate DataFrame implementation #2

Ballista should not have separate DataFrame implementation #2

Comments

andygrove commented Apr 18, 2021