Skip to content

Add Spark compatibility mode using datafusion-spark expressions #1397

@andygrove

Description

@andygrove

Summary

Add a configuration option to enable Spark-compatible expression behavior by registering functions from the datafusion-spark crate. This would help users migrating from Spark get more consistent behavior without requiring a full Spark Connect implementation.

Motivation

Ballista aims to be a compelling alternative to Apache Spark. While full Spark Connect protocol support is being addressed by other projects like LakeSail Sail, there's a simpler improvement that would help Spark users: ensuring expression/function behavior matches Spark semantics.

The datafusion-spark crate (version 51.0.0, maintained alongside DataFusion) provides:

  • Spark-compatible scalar functions
  • Spark-compatible aggregate functions
  • Spark-compatible window functions
  • Spark-compatible table functions

These functions implement Spark's specific semantics which can differ from DataFusion's defaults (e.g., null handling, type coercion, edge cases).

Proposed Solution

New Configuration Option

Add a new Ballista configuration key:

pub const BALLISTA_SPARK_COMPAT_MODE: &str = "ballista.spark_compat_mode";

With the config entry:

ConfigEntry::new(
    BALLISTA_SPARK_COMPAT_MODE.to_string(),
    "Enable Spark compatibility mode which registers Spark-compatible expressions from datafusion-spark".to_string(),
    DataType::Boolean,
    Some("false".to_string())
)

Implementation

When ballista.spark_compat_mode is enabled:

  1. Scheduler side: Register datafusion-spark functions when creating the SessionContext
  2. Executor side: Ensure the same functions are available during plan execution
use datafusion_spark::register_all;

if config.spark_compat_mode() {
    register_all(&mut ctx)?;
}

Feature Flag

Add an optional feature to ballista-core and ballista-scheduler:

[features]
spark-compat = ["datafusion-spark"]

[dependencies]
datafusion-spark = { version = "51", optional = true }

This keeps the dependency optional for users who don't need Spark compatibility.

Usage

CLI

ballista-scheduler --spark-compat-mode
ballista-executor --spark-compat-mode

Environment Variable

BALLISTA_SPARK_COMPAT_MODE=true ballista-scheduler

Programmatic

let config = BallistaConfig::builder()
    .set(BALLISTA_SPARK_COMPAT_MODE, "true")
    .build()?;

Benefits

  1. Low effort, high value: Leverages existing datafusion-spark crate
  2. Incremental migration path: Users can test Spark compatibility without full commitment
  3. Transparent: Clear config flag makes behavior explicit
  4. Optional: Feature-flagged to avoid bloating builds for users who don't need it

Future Extensions

This could be extended to include:

  • Spark SQL dialect parsing (when available in DataFusion)
  • Additional Spark-specific behaviors (null ordering, case sensitivity)
  • Integration with datafusion-comet-spark-expr for even more compatibility

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions