Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexed field access for List #1006

Merged
merged 15 commits into from
Oct 29, 2021
Merged

Indexed field access for List #1006

merged 15 commits into from
Oct 29, 2021

Conversation

Igosuki
Copy link
Contributor

@Igosuki Igosuki commented Sep 15, 2021

Which issue does this PR close?

Closes #1005

Rationale for this change

Supporting accessing nested values is a great improvement as it makes everything more flexible since input data doesn't need to be flat, and so it doesn't need to be transformed by ETL prior to being ingested by datafusion.

What changes are included in this PR?

Nested value access for Lists and Dictionary

Are there any user-facing changes?

users will now be able to do : list[0][0] and dict['foo']['bar']

No breaking changes.

I need to test the dictionary access I've only used the list one.

@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate sql SQL Planner labels Sep 15, 2021
@Igosuki
Copy link
Contributor Author

Igosuki commented Sep 15, 2021

I had to patch sqlparser to make this work, in order for sqlparser to support numbers as field selector.

@houqp
Copy link
Member

houqp commented Sep 16, 2021

@Igosuki could you send your sqlparser-rs patch to upstream? We can help review release a new version to crates.io.

@houqp houqp added the enhancement New feature or request label Sep 16, 2021
@Igosuki
Copy link
Contributor Author

Igosuki commented Sep 16, 2021

@alamb
Copy link
Contributor

alamb commented Sep 17, 2021

I will try and get sql parser change merged in and a new release created something this weekend

@alamb
Copy link
Contributor

alamb commented Sep 17, 2021

This PR (that we never completed) from @jorgecarleitao has related work I think -- namely access to struct fields: #628

@Igosuki
Copy link
Contributor Author

Igosuki commented Sep 17, 2021 via email

@Igosuki
Copy link
Contributor Author

Igosuki commented Sep 20, 2021

Will get back to this soon once I sort out the related issue here apache/datafusion-sqlparser-rs#356

@Igosuki
Copy link
Contributor Author

Igosuki commented Sep 24, 2021

I updated sqlparser for arbitrary nested access. But I'm having trouble understanding how select on dict should behave for instance with this e2c00f1#diff-6a157f1b320a619e7a3896027ce7d47a8e3ffeeae8ac3d21fe9157c92ebc2a6bL4955

@alamb
Copy link
Contributor

alamb commented Sep 24, 2021

I am preparing a sqlparser release FWIW: apache/datafusion-sqlparser-rs#360

@houqp
Copy link
Member

houqp commented Oct 9, 2021

@Igosuki do you need any help finishing up this PR?

@Igosuki
Copy link
Contributor Author

Igosuki commented Oct 10, 2021

@houqp Indeed, I mean I could just remove the dict lookup and do it in another PR so we could just get array access here.
The problem is here https://github.com/apache/arrow-datafusion/pull/1006/files#diff-08a2204dc38facfa0757eed25f99d9047fcd3b44bd03df02c89f6f71afacefefR143 and here https://github.com/apache/arrow-datafusion/pull/1006/files#diff-6a157f1b320a619e7a3896027ce7d47a8e3ffeeae8ac3d21fe9157c92ebc2a6bR5218
If there I have a column, each row may contain a dict, and so if I do a reverse lookup, then it should return a value for each non-null row right, but instead the dict is treated as a single instance.

@houqp
Copy link
Member

houqp commented Oct 10, 2021

I think there might be some confusions here. In my mind, col["a"]["b"] should access col as a dictionary in the sense of hashmap, not Arrow dictionary encoded array. In other words, when index key is a string, I think we should be matching for Struct datatype instead of dictionary type.

BTW, I am perfectly fine with us splitting the PRs into two and handle the dictionary access as a follow up if you prefer to do so 👍

@Igosuki
Copy link
Contributor Author

Igosuki commented Oct 12, 2021

@houqp Ok, it's what I had in mind as well.
Although now I am reminded that dict lookup is pretty useful to join lookup data with queries at runtime and limit the amount of data stored, so this code will still be useful, just not in this way.

@Igosuki Igosuki force-pushed the map_access_i branch 2 times, most recently from 4b7eabc to 48cad8f Compare October 12, 2021 08:42
@Igosuki
Copy link
Contributor Author

Igosuki commented Oct 12, 2021

Removed dictionary lookup so it should be gtg now

@Igosuki
Copy link
Contributor Author

Igosuki commented Oct 12, 2021

needs to rerun at on sha 48cad8f

@@ -245,6 +246,13 @@ pub enum Expr {
IsNull(Box<Expr>),
/// arithmetic negation of an expression, the operand must be of a signed numeric data type
Negative(Box<Expr>),
/// Returns the field of a [`ListArray`] or ['DictionaryArray'] by name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by string name or integer indices?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Igosuki I think made a good point here, would be better to mention both index and name. On top of this, I think DictionaryArray should be removed here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the field of a [`ListArray`] or ['DictionaryArray'] by name
/// Returns the field of a [`ListArray`] by name

let iter = concat(vec.as_slice()).unwrap();
Ok(ColumnarValue::Array(iter))
}
_ => Err(DataFusionError::NotImplemented(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using other and provide type name is more helpful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Igosuki what do you think about the comment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh guess I overlooked this, will fix

@Igosuki
Copy link
Contributor Author

Igosuki commented Oct 26, 2021

Agree with @jimexist 's comment, the rest looks good to me. @Igosuki for the follow up struct field access support PR, you can follow the physical plan evaluation implementation that @jorgecarleitao started awhile back at: https://github.com/apache/arrow-datafusion/pull/628/files#diff-ae63212535ac34ea66a26dc4726715b594b05794376b1b1fb511ff32211ed576R75. I think your SQL syntax combined with that physical plan implementation should be solid +1

I'll get to it as soon as this is merged

@Igosuki
Copy link
Contributor Author

Igosuki commented Oct 26, 2021

I addressed the concerns in the comments, things should be gtg

@alamb
Copy link
Contributor

alamb commented Oct 27, 2021

FYI I have this PR on my list to review, I just haven't had a chance to to do so yet -- will try and get it done tomorrow

@Igosuki
Copy link
Contributor Author

Igosuki commented Oct 28, 2021

No worries, I've been slow as well and obviously there's a lot going on on the project. When it gets validated I'll do the follow up for struct array access.

@houqp houqp requested a review from jimexist October 28, 2021 15:04
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking good personally -- thank you @Igosuki !

Let us know if you would like some help finishing up the comments

@@ -245,6 +246,13 @@ pub enum Expr {
IsNull(Box<Expr>),
/// arithmetic negation of an expression, the operand must be of a signed numeric data type
Negative(Box<Expr>),
/// Returns the field of a [`ListArray`] or ['DictionaryArray'] by name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the field of a [`ListArray`] or ['DictionaryArray'] by name
/// Returns the field of a [`ListArray`] by name

@@ -231,6 +231,7 @@ pub mod variable;
pub use arrow;
pub use parquet;

pub mod field_util;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub mod field_util;
pub (crate) mod field_util;

Unless there is some reason to expose this module, I think it might be nice to leave it internal so that we can move it around as needed

// specific language governing permissions and limitations
// under the License.

//! get field of a struct array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! get field of a struct array
//! get field of a `ListArray`

))
}
_ => Err(DataFusionError::Plan(
"The expression to get an indexed field is only valid for `List` or 'Dictionary'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The expression to get an indexed field is only valid for `List` or 'Dictionary'"
"The expression to get an indexed field is only valid for `List` types"

Ok(ColumnarValue::Array(iter))
}
_ => Err(DataFusionError::NotImplemented(
"get indexed field is only possible on lists".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Untested:

Suggested change
"get indexed field is only possible on lists".to_string(),
format!("get indexed field is only possible on lists with int64 indexes. Tried {} with {} index", array.data_type(), self.key())

)),
},
ColumnarValue::Scalar(_) => Err(DataFusionError::NotImplemented(
"field is not yet implemented for scalar values".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"field is not yet implemented for scalar values".to_string(),
"field access is not yet implemented for scalar values".to_string(),

let arg = self.arg.evaluate(batch)?;
match arg {
ColumnarValue::Array(array) => match (array.data_type(), &self.key) {
(DataType::List(_), ScalarValue::Int64(Some(i))) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if self.key is null, I think we could also return null

So perhaps we could add a new case like this (untested):

(DataType::List(_), _) if self.key.is_null=> { 
  let scalar_null: ScalarValue = array.data_type().try_into()?;
  Ok(ColumnarValue::Scalar(scalar_null))
}

Copy link
Contributor Author

@Igosuki Igosuki Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the issue of an empty list which doesn't get through concat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good opportunity for follow up work / PRs

@@ -141,6 +143,10 @@ fn create_physical_name(e: &Expr, is_first_expr: bool) -> Result<String> {
let expr = create_physical_name(expr, false)?;
Ok(format!("{} IS NOT NULL", expr))
}
Expr::GetIndexedField { expr, key } => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Igosuki
Copy link
Contributor Author

Igosuki commented Oct 29, 2021

@alamb Added some test coverage in get_indexed_field.rs let me know if you agree with the changes

@alamb
Copy link
Contributor

alamb commented Oct 29, 2021

Thanks for sticking with it @Igosuki !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support nested list and map (dictionary) access
4 participants