Conversation
The last I recall thinking about this was summarized in this comment. The functions, at least as can be seen in other db's or query engines, are very similar with extract being slightly more powerful by allowing one to define which group to extract. Frankly, I could see datafusion having one function that does both (aliased to regexp_substr and regexp_extract) where an optional 'index' or 'group' can be provided (defaulting to 0) that denotes which capture group to return. |
| | 200 | | ||
| +---------------------------------------------------------+ | ||
| ``` | ||
| Additional examples can be found [here](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/builtin_functions/regexp.rs) |
There was a problem hiding this comment.
If we are not going to add examples to the regexp.rs file I would suggest removing this line.
| argument(name = "str", description = "Column or column name"), | ||
| argument( | ||
| name = "regexp", | ||
| description = r#"a string representing a regular expression. The regex string should be a |
There was a problem hiding this comment.
If this is indeed the case (java) this function belongs in the spark crate, not in the main datafusion functions crate.
| ) -> Result<ColumnarValue> { | ||
| let args = &args.args; | ||
|
|
||
| if args.len() != 2 && args.len() != 3 { |
There was a problem hiding this comment.
I'm not sure how this could possibly work. If args.len() == 2 it'll fail the second condition, if 3, the first.
There was a problem hiding this comment.
If it's neither 2 nor 3, then it's an error.
So, if len == 2, it'll fail the check on 3, hence won't enter the branch.
Maybe written as following could read easier?
| if args.len() != 2 && args.len() != 3 { | |
| if ! (args.len() == 2 || args.len() == 3 ) { |
|
From what I remember it was quite complicated to expose rust backed regexp into JVM world, because of rust/jvm regexp processing difference.
Theoretically we still can expose the function but Spark users need to be careful, accept the nuances and this needs to be documented. |
Jefffrey
left a comment
There was a problem hiding this comment.
Sounds like we should proceed with adding this as a function given other dbs/engines have something similar; however we should probably approach this from an angle of adding it as a datafusion function, but not necessarily to match Spark exactly given what @comphead outlined.
| /// Extracts a group that matches `regexp`. If `idx` is not specified, | ||
| /// it defaults to 1. | ||
| /// | ||
| /// Matches Spark's DataFrame API: `regexp_extract(e: Column, exp: String, groupIdx: Int)` |
There was a problem hiding this comment.
We probably should remove mention of Spark since we're adding this as a DataFusion function (i.e. not to the datafusion-spark crate)
| use std::any::Any; | ||
| use std::sync::Arc; | ||
|
|
||
| // See https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_extract.html |
| } | ||
|
|
||
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||
| use DataType::*; |
There was a problem hiding this comment.
We don't need all these checks in return_type; we can simply return Ok(Utf8) as signature should guard this for us
| } | ||
|
|
||
| // DataFusion passes either scalars or arrays. Convert to arrays. | ||
| let len = args |
There was a problem hiding this comment.
We should just use make_scalar_function which handles this boilerplate for us if we don't want to deal with columnarvalues
There was a problem hiding this comment.
Thanks, I'll try to look into it!
| } | ||
|
|
||
| /// Helper to build args for tests and external callers. | ||
| pub fn regexp_extract(args: &[ArrayRef]) -> Result<ArrayRef> { |
There was a problem hiding this comment.
This doesn't need to be public
|
|
||
| /// Helper to build args for tests and external callers. | ||
| pub fn regexp_extract(args: &[ArrayRef]) -> Result<ArrayRef> { | ||
| if args.len() != 3 { |
There was a problem hiding this comment.
If it needs 3 arguments we should make the signature 3 distinct arguments instead of a slice
There was a problem hiding this comment.
Here's a small omission (either 2 or 3). If there's a desire to always have only 3 I can change it everywhere, but it'll make it diverge slightly from spark (where the group idx is optional and defaults to 1 when not specified).
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
Could we move all these tests to be SLTs instead?
There was a problem hiding this comment.
I was following the existing pattern in this mod.
My $0.02: having unit tests write next to the definition helps future evolution (I for one find the step-by-step debugging much more efficient).
If there's a strong desire to remove them, I can (but all the other unit tests should also be removed in order to be consistent).
| let pattern = &args[1]; | ||
| let index = &args[2]; | ||
|
|
||
| let values_array = values |
There was a problem hiding this comment.
Can use as_string_array for easier downcasting here; same idea for int array below too
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
What changes are NOT included in this PR?
Are these changes tested?
Are there any user-facing changes?
Yes (new regex function added added to the docs).