From 00ff13e0ba5ad8ac0ab8843681969fcde7f2831f Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Wed, 22 Feb 2023 07:23:28 -0700 Subject: [PATCH 1/2] DataFusion Substrait blog post --- _posts/2023_02_27_datafusion_substrait.md | 131 ++++++++++++++++++++++ 1 file changed, 131 insertions(+) create mode 100644 _posts/2023_02_27_datafusion_substrait.md diff --git a/_posts/2023_02_27_datafusion_substrait.md b/_posts/2023_02_27_datafusion_substrait.md new file mode 100644 index 000000000000..7778274ed5d6 --- /dev/null +++ b/_posts/2023_02_27_datafusion_substrait.md @@ -0,0 +1,131 @@ +--- +layout: post +title: "DataFusion Now Supports Substrait" +date: "2023-02-27 00:00:00" +author: "pmc" +categories: [arrow] +--- + + + +## Introduction + +The Apache Arrow PMC is pleased to announce that the DataFusion project has accepted the donation of the +datafusion-substrait crate, which was developed by the DataFusion community under the +[datafusion-contrib](https://github.com/datafusion-contrib/) GitHub organization. + +Substrait provides a standardized representation of query plans and expressions. In many ways, the project's goal +is similar to that of the Arrow project. Arrow standardizes the memory representation of columnar data. Substrait +standardizes the representation of operations on data, such as filter and query plans. + +Now that DataFusion can directly run Substrait query plans, there are several exciting new integration possibilities: + +- Pass serialized query plan across language boundaries, such as passing from Python to Rust or Rust to C++. For + , a Python based SQL frontend could pass a Substrait plan to DataFusion which is written in Rust. +- Mixing and matching query engine front-ends and back-ends based on their specific strengths. For example, using + DataFusion for query planning, and Velox for execution, or Calcite for query planning and DataFusion for execution. +- Easier integration for other DataFusion based projects. For example, the related Ballista project, which already + provides “distributed DataFusion” execution plans, serializes query plans using a protobuf format that predates + the Substrait project. By adopting Substrait, Ballista can provide distributed scheduling for query engines + other than DataFusion. + +## Logical Plan Support + +DataFusion currently supports serialization and deserialization of the following logical operators and expressions with Substrait. + +### Operators / Relations + +| DataFusion | SQL | DataFusion Supported Subtypes | +| ------------- |-------------------------| --------------------------------------- | +| Projection | SELECT | | +| TableScan | FROM | | +| Filter | WHERE | | +| Aggregate | GROUP BY | | +| Sort | ORDER BY | | +| Join | JOIN | LEFT, RIGHT, FULL, LEFT ANTI, LEFT SEMI | +| Limit | LIMIT | | +| Distinct | DISTINCT | | +| SubqueryAlias | \ AS \ | | + +### Expressions + +| DataFusion | SQL | DataFusion Supported Subtypes | +| ----------------- |-----------------------------| ------------------------------------------------------------------------------------------------------------- | +| AggregateFunction | | | +| Alias | \ AS \ | | +| Column | \ | | +| BinaryExpr | \ | | +| Between | BETWEEN \ AND \ | | +| Case | CASE ... WHEN ... END | | +| Literal | | Int8, Int16, Int32, Int64, Boolean, Float32, Float64, Decimal128, Utf8, LargeUtf8, Binary, LargeBinary, Date32 | +| Literal Null | | Int8 | Int16 | Int32 | Int64 | Decimal128 | + +## Physical Plan Support + +There is also preliminary work on supporting serialization of physical plans. The tracking issue for this is +[#5173](https://github.com/apache/arrow-datafusion/issues/5173). + +## Python Bindings + +Substrait support is also available from DataFusion’s [Python bindings](https://github.com/apache/arrow-datafusion-python/). + +Source code for this example is available [here](https://github.com/apache/arrow-datafusion-python/blob/main/examples/substrait.py). + +```python +from datafusion import SessionContext +from datafusion import substrait as ss + +# Create a DataFusion context +ctx = SessionContext() + +# Register table with context +ctx.register_parquet('aggregate_test_data', './testing/data/csv/aggregate_test_100.csv') + +substrait_plan = ss.substrait.serde.serialize_to_plan("SELECT * FROM aggregate_test_data", ctx) +# type(substrait_plan) -> + +# Alternative serialization approaches +# type(substrait_bytes) -> , at this point the bytes can be distributed to file, network, etc safely +# where they could subsequently be deserialized on the receiving end. +substrait_bytes = ss.substrait.serde.serialize_bytes("SELECT * FROM aggregate_test_data", ctx) + +# Imagine here bytes would be read from network, file, etc ... for example brevity this is omitted and variable is simply reused +# type(substrait_plan) -> +substrait_plan = ss.substrait.serde.deserialize_bytes(substrait_bytes) + +# type(df_logical_plan) -> +df_logical_plan = ss.substrait.consumer.from_substrait_plan(ctx, substrait_plan) + +# Back to Substrait Plan just for demonstration purposes +# type(substrait_plan) -> +substrait_plan = ss.substrait.producer.to_substrait_plan(df_logical_plan) +``` + +## Availability + +Substrait support is available in DataFusion 18.0.0 and version 0.8.0 of the Python bindings. + +## Get Involved + +The Substrait support is at an early stage of development, and we would welcome more contributors to expand the +functionality and to help with compatibility testing with other data infrastructure that supports Substrait. + +If you are interested in getting involved, an excellent place to start is to read our communication and +contributor guides. From a4585426ddf8d98b5c48b41939680646204aaca9 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 23 Feb 2023 15:31:36 -0700 Subject: [PATCH 2/2] rename file, add links --- ...rait.md => 2023-02-27-datafusion-substrait.md} | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) rename _posts/{2023_02_27_datafusion_substrait.md => 2023-02-27-datafusion-substrait.md} (92%) diff --git a/_posts/2023_02_27_datafusion_substrait.md b/_posts/2023-02-27-datafusion-substrait.md similarity index 92% rename from _posts/2023_02_27_datafusion_substrait.md rename to _posts/2023-02-27-datafusion-substrait.md index 7778274ed5d6..d541eb2a401d 100644 --- a/_posts/2023_02_27_datafusion_substrait.md +++ b/_posts/2023-02-27-datafusion-substrait.md @@ -53,7 +53,7 @@ DataFusion currently supports serialization and deserialization of the following ### Operators / Relations | DataFusion | SQL | DataFusion Supported Subtypes | -| ------------- |-------------------------| --------------------------------------- | +| ------------- | ----------------------- | --------------------------------------- | | Projection | SELECT | | | TableScan | FROM | | | Filter | WHERE | | @@ -67,7 +67,7 @@ DataFusion currently supports serialization and deserialization of the following ### Expressions | DataFusion | SQL | DataFusion Supported Subtypes | -| ----------------- |-----------------------------| ------------------------------------------------------------------------------------------------------------- | +| ----------------- | --------------------------- | -------------------------------------------------------------------------------------------------------------- | ----- | ----- | ----- | ---------- | | AggregateFunction | | | | Alias | \ AS \ | | | Column | \ | | @@ -79,12 +79,12 @@ DataFusion currently supports serialization and deserialization of the following ## Physical Plan Support -There is also preliminary work on supporting serialization of physical plans. The tracking issue for this is +There is also preliminary work on supporting serialization of physical plans. The tracking issue for this is [#5173](https://github.com/apache/arrow-datafusion/issues/5173). ## Python Bindings -Substrait support is also available from DataFusion’s [Python bindings](https://github.com/apache/arrow-datafusion-python/). +Substrait support is also available from DataFusion’s [Python bindings](https://github.com/apache/arrow-datafusion-python/). Source code for this example is available [here](https://github.com/apache/arrow-datafusion-python/blob/main/examples/substrait.py). @@ -124,8 +124,9 @@ Substrait support is available in DataFusion 18.0.0 and version 0.8.0 of the Pyt ## Get Involved -The Substrait support is at an early stage of development, and we would welcome more contributors to expand the +The Substrait support is at an early stage of development, and we would welcome more contributors to expand the functionality and to help with compatibility testing with other data infrastructure that supports Substrait. -If you are interested in getting involved, an excellent place to start is to read our communication and -contributor guides. +If you are interested in getting involved, an excellent place to start is to read our +[communication](https://arrow.apache.org/datafusion/contributor-guide/communication.html) and +[contributor](https://arrow.apache.org/datafusion/contributor-guide/index.html) guides.