Skip to content

Conversation

@ovr
Copy link
Contributor

@ovr ovr commented Apr 10, 2022

Closes: #2207

Hello!

I've opened a PR as a Draft to indicate that I am working on the task of supporting the array index operator in DF.

Breaking changes

  • It's started to allow using negative number indexes as PostgreSQL does.
  • Use 1-based indexes as PostgreSQL does.

image

Thanks

)),
ColumnarValue::Scalar(scalar) => match (scalar.get_datatype(), &self.key) {
(DataType::List(v), ScalarValue::Int64(Some(i))) => {
let wrapper = scalar.to_array();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I implement a new method that will convert Vec (which ScalarValue holds) for List instead of using the current API and temporarily ArrayRef? For example: as_array_list()?

Thanks

cC @alamb

Copy link
Contributor

@alamb alamb Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if initially you might be able to use ScalarValue::to_array_of_size to convert the scalar argument into an ArrayRef and then use the same code as above:

Another alternative might be to use the take kernel with something like

Copy link
Contributor Author

@ovr ovr Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_array uses to_array_of_size under the hood, but I cannot reuse the code above, because it returns an ColumnarValue::Array, but in the case with ColumnarValue::Scalar we need to return ColumnarValue::Scalar.

I did another draft in af8c77f WDYT?

Thanks

Copy link
Contributor Author

@ovr ovr May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Sorry for the ping, just a friendly reminder that I am still waiting for advice on how to solve it. Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry @ovr -- taking a look

@ovr ovr force-pushed the scalar-array-index branch from c58afcf to af8c77f Compare April 28, 2022 21:53
@ovr ovr changed the title feat: Support ArrayIndex for ScalarValue(List) feat: Support GetIndexedFieldExpr for ScalarValue Apr 28, 2022

#[tokio::test]
async fn test_array_index() -> Result<()> {
test_expression!("([5,4,3,2,1])[1]", "5");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.postgresql.org/docs/current/arrays.html

I verified that the subscripts are 1 based 👍

The array subscript numbers are written within square brackets. By default PostgreSQL uses a one-based numbering convention for arrays, that is, an array of n elements starts with array[1] and ends with array[n].

test_expression!("([5,4,3,2,1])[1]", "5");
test_expression!("([5,4,3,2,1])[5]", "1");
test_expression!("([5,4,3,2,1])[100]", "NULL");
test_expression!("([5,4,3,2,1])[-1]", "NULL");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you want to potentially try nested lists. Something like

Suggested change
test_expression!("([5,4,3,2,1])[-1]", "NULL");
test_expression!("([5,4,3,2,1])[-1]", "NULL");
test_expression!("([[123],[4,5,6]])[2]", "[4,5,6]");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, it's not possible to define multi-dimension arrays via SQL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, Nope, I was wrong.

ColumnarValue::Scalar(_) => Err(DataFusionError::NotImplemented(
"field access is not yet implemented for scalar values".to_string(),
)),
ColumnarValue::Scalar(scalar) => match (scalar.get_datatype(), &self.key) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I fully follow this code -- Since it is creating a ArrayRef from the ColumnarValue::Scalar, I wonder why it can't use the same code as the ColumnarValue::Array case and call to_arrow()?

https://github.com/cube-js/arrow-datafusion/blob/scalar-array-index/ballista/rust/client/src/columnar_batch.rs#L150

So for example, rather than

    fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
        let arg = self.arg.evaluate(batch)?;
        match arg {
            ColumnarValue::Array(array) => match (array.data_type(), &self.key) {
...

It could look like:

    fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
        let array = self.arg.evaluate(batch)?
          // convert to Arrayref
          .to_arrow();

        match (array.data_type(), &self.key) {
...

That way the same code would be used for the array and scalar cases of ColumnarValue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Done.

@ovr ovr force-pushed the scalar-array-index branch from af8c77f to e053234 Compare June 2, 2022 16:18
@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates labels Jun 2, 2022
} else {
Ok(Field::new(&i.to_string(), lt.data_type().clone(), false))
}
Ok(Field::new(&i.to_string(), lt.data_type().clone(), true))
Copy link
Contributor Author

@ovr ovr Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the check for negative integers to be compatible with PostgreSQL.

image

@ovr ovr marked this pull request as ready for review June 2, 2022 16:45
@ovr
Copy link
Contributor Author

ovr commented Jun 2, 2022

Rebased ✅ Changed to simplify/reuse logic across Scalar/ArrayRef. cC @alamb

@andygrove andygrove self-requested a review June 3, 2022 13:10
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @ovr.

@andygrove andygrove changed the title feat: Support GetIndexedFieldExpr for ScalarValue Support GetIndexedFieldExpr for ScalarValue Jun 3, 2022
@andygrove
Copy link
Member

I'll go ahead and merge this later today unless @alamb has any additional feedback

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thanks @ovr

@alamb alamb merged commit a11b79e into apache:master Jun 3, 2022
@ovr ovr deleted the scalar-array-index branch June 3, 2022 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support ArrayIndex for ScalarValue(List)

3 participants