-
Notifications
You must be signed in to change notification settings - Fork 260
Description
Summary
Add a configuration option to enable Spark-compatible expression behavior by registering functions from the datafusion-spark crate. This would help users migrating from Spark get more consistent behavior without requiring a full Spark Connect implementation.
Motivation
Ballista aims to be a compelling alternative to Apache Spark. While full Spark Connect protocol support is being addressed by other projects like LakeSail Sail, there's a simpler improvement that would help Spark users: ensuring expression/function behavior matches Spark semantics.
The datafusion-spark crate (version 51.0.0, maintained alongside DataFusion) provides:
- Spark-compatible scalar functions
- Spark-compatible aggregate functions
- Spark-compatible window functions
- Spark-compatible table functions
These functions implement Spark's specific semantics which can differ from DataFusion's defaults (e.g., null handling, type coercion, edge cases).
Proposed Solution
New Configuration Option
Add a new Ballista configuration key:
pub const BALLISTA_SPARK_COMPAT_MODE: &str = "ballista.spark_compat_mode";With the config entry:
ConfigEntry::new(
BALLISTA_SPARK_COMPAT_MODE.to_string(),
"Enable Spark compatibility mode which registers Spark-compatible expressions from datafusion-spark".to_string(),
DataType::Boolean,
Some("false".to_string())
)Implementation
When ballista.spark_compat_mode is enabled:
- Scheduler side: Register datafusion-spark functions when creating the SessionContext
- Executor side: Ensure the same functions are available during plan execution
use datafusion_spark::register_all;
if config.spark_compat_mode() {
register_all(&mut ctx)?;
}Feature Flag
Add an optional feature to ballista-core and ballista-scheduler:
[features]
spark-compat = ["datafusion-spark"]
[dependencies]
datafusion-spark = { version = "51", optional = true }This keeps the dependency optional for users who don't need Spark compatibility.
Usage
CLI
ballista-scheduler --spark-compat-mode
ballista-executor --spark-compat-modeEnvironment Variable
BALLISTA_SPARK_COMPAT_MODE=true ballista-schedulerProgrammatic
let config = BallistaConfig::builder()
.set(BALLISTA_SPARK_COMPAT_MODE, "true")
.build()?;Benefits
- Low effort, high value: Leverages existing datafusion-spark crate
- Incremental migration path: Users can test Spark compatibility without full commitment
- Transparent: Clear config flag makes behavior explicit
- Optional: Feature-flagged to avoid bloating builds for users who don't need it
Future Extensions
This could be extended to include:
- Spark SQL dialect parsing (when available in DataFusion)
- Additional Spark-specific behaviors (null ordering, case sensitivity)
- Integration with datafusion-comet-spark-expr for even more compatibility
References
- datafusion-spark crate
- DataFusion Spark Functions docs
- datafusion-comet-spark-expr (alternative/complementary)
- Ballista Spark Connect discussion: Is it possible to support Spark Connect? #964