-
Notifications
You must be signed in to change notification settings - Fork 272
Description
What is the problem the feature request solves?
Note: This issue was generated with AI assistance. The specification details have been extracted from Spark documentation and may need verification.
Comet does not currently support the Spark get_json_object function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.
The GetJsonObject expression extracts JSON values from a JSON string using JSONPath expressions. It takes a JSON string and a JSONPath query, returning the matching value(s) as a string representation.
Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.
Describe the potential solution
Spark Specification
Syntax:
get_json_object(json_string, path)// DataFrame API usage
col("json_column").getItem(path)
// or using expr()
expr("get_json_object(json_column, '$.field')")Arguments:
| Argument | Type | Description |
|---|---|---|
| json | Expression | The JSON string to query against |
| path | Expression | The JSONPath expression to extract values |
Return Type: Returns StringType - the extracted JSON value as a string representation.
Supported Data Types:
- json parameter: String type containing valid JSON
- path parameter: String type containing valid JSONPath expressions
Edge Cases:
- Null handling: Returns null if either json or path parameters are null
- Invalid JSON: Returns null for malformed JSON input strings
- Invalid JSONPath: Returns null for syntactically incorrect JSONPath expressions
- No matches: Returns null when the JSONPath doesn't match any elements
- Multiple matches: For array results, returns JSON array string representation
Examples:
-- Extract simple field
SELECT get_json_object('{"name":"John","age":30}', '$.name');
-- Result: "John"
-- Extract from array
SELECT get_json_object('[{"a":"b"},{"a":"c"}]', '$[*].a');
-- Result: ["b","c"]
-- Extract nested field
SELECT get_json_object('{"user":{"profile":{"name":"Alice"}}}', '$.user.profile.name');
-- Result: "Alice"
-- Array element access
SELECT get_json_object('{"items":["apple","banana","cherry"]}', '$.items[1]');
-- Result: "banana"// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(get_json_object(col("json_data"), "$.field"))
// Using expr for complex JSONPath
df.select(expr("get_json_object(json_column, '$[*].nested.field')"))
// Multiple extractions
df.select(
get_json_object(col("json_data"), "$.name").alias("name"),
get_json_object(col("json_data"), "$.age").alias("age")
)Implementation Approach
See the Comet guide on adding new expressions for detailed instructions.
- Scala Serde: Add expression handler in
spark/src/main/scala/org/apache/comet/serde/ - Register: Add to appropriate map in
QueryPlanSerde.scala - Protobuf: Add message type in
native/proto/src/proto/expr.protoif needed - Rust: Implement in
native/spark-expr/src/(check if DataFusion has built-in support first)
Additional context
Difficulty: Large
Spark Expression Class: org.apache.spark.sql.catalyst.expressions.GetJsonObject
Related:
json_tuple- Extract multiple JSON fields in a single operationfrom_json- Parse JSON string into structured data typesto_json- Convert structured data to JSON stringsjson_array_length- Get length of JSON arrays
This issue was auto-generated from Spark reference documentation.