-
Notifications
You must be signed in to change notification settings - Fork 29.1k
Description
Add a native ADBC (Arrow Database Connectivity) data source to Spark, similar in spirit to the existing JDBC data source but built on the Arrow-native ADBC API.
ADBC is a database connectivity API standard under the Apache Arrow project. It provides a vendor-neutral, columnar alternative to JDBC/ODBC specifically designed for analytical workloads. ADBC drivers return result sets as streams of Arrow data rather than row-by-row, which eliminates expensive row-to-columnar conversions. Since spark itself is row-based, the effect is not as dramatic, but still noticeable.
Why (now):
- There are mature native drivers for PostgreSQL, SQLite, DuckDB, Flight SQL, Snowflake, BigQuery, MySQL, SQL Server, Databricks and so on. It's also very easy to install (and locate) them on a system with dbc cli tool.
- There is now good support for invoking ADBC from Java via JNI bindings to the C++ ADBC driver manager (see blog). This makes it practical to integrate ADBC into Spark's JVM-based architecture. Technically drivers can be implemented in java as well, but the quality of java implementations is pretty low, realistically one will almost almost use a native driver.
- ADBC fits well with spark's columnar read support in data source v2. Generating ArrowColumnVectors from adbc is pretty straightforward. It can be a benefit for external spark accelerators like comet and (presumably photon).
I have a proof-of-concept implementation at spark-adbc that demonstrates the basic read path and not so scientific benchmarks vs jdbc. I'm willing to incrementally implement ADBC data source support upstream if there's interest from the community.