Skip to content

Conversation

@yutannihilation
Copy link
Contributor

@yutannihilation yutannihilation commented Nov 4, 2025

Part of #126

This pull request implements the SRIDifiedKernel wrapper, which is suggested in #126, and applies it to ST_Point() to see if it works.

Currently, it seems to work. I need to figure out how to test this because ScalarUdfTester doesn't return SedonaType.

> sd_sql("select st_srid(st_point(1, 1, 4326))")
┌──────────────────────────────────────────────────┐
│ st_srid(st_point(Int64(1),Int64(1),Int64(4326))) │
│                      uint32                      │
╞══════════════════════════════════════════════════╡
│                                             4326 │
└──────────────────────────────────────────────────┘
Preview of up to 6 row(s)

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

In Python you should be able to do:

result = eng.execute_and_collect("<sql>")
df = eng.result_to_pandas(result)
geopandas.testing.assert_geodataframe_equal(df, expected)

(In theory result_to_pandas is CRS-aware, even for PostGIS)

Comment on lines 263 to 265
// TODO: This branch is not really the "invalid CRS value" case.
// If it can be cast to Utf-8, it falls into the first branch.
return sedona_internal_err!("Invalid CRS value");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Can't cast Crs {crs:?} to Utf8?

],
);

tester.assert_return_type(WKB_GEOMETRY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great point that the UDF tester doesn't propagate CRSes because it doesn't consider scalar arguments. We could add return_type_with_with_scalars() to the ScalarUdfTester?

@yutannihilation
Copy link
Contributor Author

yutannihilation commented Nov 7, 2025

We could add return_type_with_with_scalars() to the ScalarUdfTester?

Thanks for the hint! I tried it, but currently I'm seeing this error. I hope if this is just my implementation is not good, but ScalarUdfTester::return_type() goes a different code path than actually invoking the function...?

called `Result::unwrap()` on an `Err` value: NotImplemented("st_point([Arrow(Float64), Arrow(Float64), Arrow(UInt16)]): No kernel matching arguments")

edit: answer to self. return_field_from_args() calls return_type(), not return_type_from_args_and_scalars().

https://github.com/apache/datafusion/blob/f32984b2dbf9e5a193c20643ce624167295fbd61/datafusion/expr/src/udf.rs#L628-L637

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: answer to self. return_field_from_args() calls return_type(), not return_type_from_args_and_scalars().

I think you're looking at the default trait implementation...I'm pretty the SedonaScalarFunction/SedonaKernel handles that correctly. I think that the return_type_from_args_and_scalars() here is returning None (I can't spot exactly why...popping through that in the debugger might help).

Comment on lines 243 to 250
fn return_type_from_args_and_scalars(
&self,
args: &[SedonaType],
scalar_args: &[Option<&ScalarValue>],
) -> Result<Option<SedonaType>> {
let orig_args_len = args.len() - 1;
let orig_args = &args[..orig_args_len];
let orig_scalar_args = &scalar_args[..orig_args_len];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to check the number of arguments here (return None if it's not the expected number) to avoid a panic

@yutannihilation
Copy link
Contributor Author

I'm pretty the SedonaScalarFunction/SedonaKernel handles that correctly.

Yes. It seems the problem was that ScalarUdfTester holds ScalarUDF, not SedonaScalarUDF.

@paleolimbot
Copy link
Member

ScalarUdfTester holds ScalarUDF, not SedonaScalarUDF.

The ScalarUDF is what DataFusion will interact with and calling the trait functions should still give correct results 😬

@yutannihilation
Copy link
Contributor Author

Ah, thanks. Got it at last...

@yutannihilation
Copy link
Contributor Author

yutannihilation commented Nov 8, 2025

Okay, I think I figured out. I see two problems. (Sorry I was a bit confused. You are right in that functions around return_type() itself works correctly).

First, ScalarUdfTester::invoke() calls ScalarUdfTester::return_type(), which doesn't consider scalar arguments. So, we need some hatch to specify the return type calculated from the scalar arguments. Fortunately, we have the actual scalar arguments in these invoke_*() variants, so this is easy to fix, but a code looks a bit complicated.

return_field: self.return_type()?.to_storage_field("", true)?.into(),

Second, ScalarUdfTester::assert_scalar_result_equals() also calls return_type(). In this case, we don't have clue to infer the result type, so probably it needs to be provided from outside.

let return_type = self.return_type().unwrap();

@paleolimbot paleolimbot marked this pull request as ready for review November 8, 2025 02:32
@paleolimbot paleolimbot marked this pull request as draft November 8, 2025 02:32
@yutannihilation yutannihilation marked this pull request as ready for review November 8, 2025 02:58
@yutannihilation
Copy link
Contributor Author

I think this is ready for review now, but with one caveat about the differences from PostGIS.

First, it seems PostGIS treats NULL CRS and unknown (0) CRS differently. I'm not sure where we should tweak in SedonaDB to match this behavior, but I decided not to address it in this pull request. So, I commented out the failing test with some comments.

postgres=# SELECT ST_SRID(ST_POINT(1, 1));
 st_srid
---------
       0
(1 row)

postgres=# SELECT ST_SRID(ST_POINT(1, 1, null));
 st_srid
---------

(1 row)

Also, another difference I found is that PostGIS doesn't accept CRS with authority e.g. EPSG:4326 while SedonaDB accepts. I don't think this difference is a problem, but I'm not sure if I should include this in the test cases.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, another difference I found is that PostGIS doesn't accept CRS with authority e.g. EPSG:4326 while SedonaDB accepts. I don't think this difference is a problem, but I'm not sure if I should include this in the test cases.

We added ST_SetCrs() for this case to keep ST_SRID() more similar, although I think that the convenience of ST_Point(1, 2, '<string>') will be worth the slight digression from PostGIS...we also allow this in ST_Transform().

Comment on lines 1321 to 1327
# TODO: This is a bit tricky, but in PostGIS, NULL and unknown CRS are distinguished.
#
# - ST_SRID(ST_POINT(x, y, NULL)) returns NULL
# - ST_SRID(ST_POINT(x, y, 0)) returns 0
# - ST_SRID(ST_POINT(x, y)) returns 0
#
# (1, 1, None, None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that ST_SetSRID() handles this in the same way as PostGIS and it would be helpful to handle that here (I'll leave a suggestion below about how we might do that)

Comment on lines 296 to 304
fn invoke_batch(
&self,
arg_types: &[SedonaType],
args: &[ColumnarValue],
) -> Result<ColumnarValue> {
let orig_args_len = arg_types.len() - 1;
self.inner
.invoke_batch(&arg_types[..orig_args_len], &args[..orig_args_len])
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the place where we'd have to check if let ColumnarValue::Scalar(sc) = args[orig_args_len] { if sc.is_null(), and perhaps modify the validity buffer of the inner.invoke_batch().to_array(). Feel free to punt on that and file a follow-on ticket 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, thanks! I didn't notice ST_POINT(x, y, NULL) should return NULL... I'll fix.

@yutannihilation
Copy link
Contributor Author

perhaps modify the validity buffer of the inner.invoke_batch().to_array()

I tried this approach, but I couldn't find any API that exposes the actual buffer as mutable. So, I chose a different way that skips invoke_batch().

@yutannihilation
Copy link
Contributor Author

(Just a side note)
After I commented above, I started to wonder if it's really fine to skip invoking the actual logic. For example, should ST_GeomFromText() reject invalid WKT inputs when the SRID is NULL? But, it seems PostGIS also skips any checks, so it should be fine.

postgres=# SELECT ST_GeomFromText('point (1 1', null);
 st_geomfromtext 
-----------------

(1 row)

postgres=# SELECT ST_GeomFromText('point (1 1');
ERROR:  parse error - invalid geometry
HINT:  "point (1 1" <-- parse error at position 12 within geometry

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

I think you're right about propagating the error and there's one small potential improvement; however, this is great and those are unlikely corner cases I'm happy to punt into a future follow-on when we have time.

Comment on lines 308 to 310
// If the specified SRID is NULL, the result is also NULL. So, return
// NULL early and doesn't run `invoke_batch()`.
if let ColumnarValue::Scalar(sc) = &args[orig_args_len] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your last point is a good one...we should probably invoke_batch first no matter what (to propagate any errors). I don't think that anybody is relying the performance of returning a column full of nulls because they probably made a mistake 🙂

Comment on lines +251 to +264
// args should consist of the original args and one extra arg for
// specifying CRS. So, first, validate the length and separate these.
//
// [arg0, arg1, ..., crs_arg];
// ^^^^^^^^^^^^^^^
// orig_args
let orig_args_len = match (args.len(), scalar_args.len()) {
(0, 0) => return Ok(None),
(l1, l2) if l1 == l2 => l1 - 1,
_ => return sedona_internal_err!("Arg types and arg values have different lengths"),
};

let orig_args = &args[..orig_args_len];
let orig_scalar_args = &scalar_args[..orig_args_len];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand this better now...this works, although probably if args.len() == 0 { return Ok(None) } is sufficient (there are a lot of places in our code where we rely on DataFusion passing us the right number of things). I was worried before that if somebody passed (e.g.,) a single argument to ST_Point() something funny would happen here, but I see now that the call to the inner.return_type_from_args_and_scalars() will return correctly return Ok(None) for that case.

Totally optional, but the error message for something like ST_Point('gazornenplat') would probably be better if you moved the call to inner.return_type_from_args_and_scalars() before the CRS parsing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sounds good to me.

@yutannihilation
Copy link
Contributor Author

Done! I hope I got what you meant.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

A quick note that hopefully in the next week or so we'll have item-level CRSes (i.e., for the "srid is an array" case we'll have a separate return type). I don't think that's much code change on top of this but thought I'd put it out there in case it affects anything you're working on!

@paleolimbot paleolimbot merged commit 943d149 into apache:main Nov 9, 2025
12 checks passed
@yutannihilation
Copy link
Contributor Author

Thanks, I saw the issue about item-level CRSes and was wondering how it relates to here. Looking forward to seeing how it is implemented!

@yutannihilation
Copy link
Contributor Author

Just curious. I think this doesn't happen yet. Was it that you were simply too busy (I guess releasing is a tough job!), or you found some technical difficulty?

A quick note that hopefully in the next week or so we'll have item-level CRSes

@paleolimbot
Copy link
Member

😬

I just didn't get there (focused on file IO for 0.2.0). The issue for this is #136 (I'll add some background to that on vaguely how I think it will work)

@yutannihilation
Copy link
Contributor Author

Thanks, good to know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants