Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error "entered unreachable code: NamedStructField should be rewritten in OperatorToFunction" after upgrade to 37 #10181

Closed
alamb opened this issue Apr 22, 2024 · 19 comments · Fixed by #10330
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Apr 22, 2024

NOTE -- Here is an example of how to make Expr::NamedStructField work in 37.1.0: #10183

Is your feature request related to a problem or challenge?

In 37.0.0 many of the built in functions have been migrated to UDFs as described on #8045 . The migration is completed in 38.0.0

One part of this change is that now certain Exprs must be rewritten into the appropriate functions. Most notably get_field that extracts a field from a Struct

Among other things this allows people to customize how Expr behaves: #7845 (comment) or in slack to return NULLs for rows that don't pass in maps

The rewrite happens automatically as part of the logical planner (in the Analyzer pass)

However if you bypass those passes it will not happen

Yeah you need to use the FunctionRewriter here (with the relevant rewriter registered) https://github.com/apache/arrow-datafusion/blob/0573f78c7e7a4d94c3204cee464b3860479e0afb/datafusion/optimizer/src/analyzer/function_rewrite.rs#L33

Example

An example from discord: link is:

  let schema = Schema::new(vec![
        Field::new("id", DataType::Utf8, true),
        Field::new(
            "props",
            DataType::Struct(Fields::from(vec![Field::new("a", DataType::Utf8, true)])),
            true,
        ),
    ]);

    println!("schema {:?}", schema);

    let df_schema = DFSchema::try_from(schema.clone()).unwrap();

    let plan = table_scan(Some("props_test"), &schema, None)?
        .filter(col("props").field("a").eq(lit("2021-02-02")))?
        .build()?;
    println!("logical plan {:?}", plan);
    let phys = DefaultPhysicalPlanner::default().create_physical_expr(&plan.expressions()[0], &df_schema, &SessionContext::new().state())?;
    println!("phys {:?}", phys);
    Ok(())

This returns an error "NamedStructField should be rewritten in OperatorToFunction"

Describe the solution you'd like

No response

Describe alternatives you've considered

One potential workaround is to call get_field directly rather than Expr::field

So instead of

    let plan = table_scan(Some("props_test"), &schema, None)?
        .filter(col("props").field("a").eq(lit("2021-02-02")))?
        .build()?;

call like

  let plan = table_scan(Some("props_test"), &schema, None)?
        .filter(get_field(col("props", "a")).eq(lit("2021-02-02")))?
        .build()?;

Additional context

@ion-elgreco is seeing the same issue in Delta-rs: #9904 (comment)

I tried it with 37.1.0 in delta-rs, but we still get this error: internal error: entered unreachable code: NamedStructField should be rewritten in OperatorToFunction, wasn't this regression fixed?

@westonpace brings it up in discord link

Another report in discord: link

@alamb alamb added the enhancement New feature or request label Apr 22, 2024
@alamb
Copy link
Contributor Author

alamb commented Apr 22, 2024

Example from @ion-elgreco

@alamb this is the code:

  let (table, _metrics) = DeltaOps(table)
            .delete()
            .with_predicate("props['a'] = '2021-02-02'")
            .await
            .unwrap();

Which comes from here: https://github.com/delta-io/delta-rs/blob/main/crates%2Fcore%2Fsrc%2Foperations%2Fdelete.rs#L770-L774

@alamb
Copy link
Contributor Author

alamb commented Apr 22, 2024

I see a few options:

  1. We can put get_field back into the core crate (and move it out of datafusion-functions) - might be the least surprising but would not allow people to customize how field access worked (e.g. for implementing JSON support)
  2. We can make a better API / more examples how to get the core rewrites working in
    fn evaluate_demo() -> Result<()> {
    // For example, let's say you have some integers in an array
    let batch = RecordBatch::try_from_iter([(
    "a",
    Arc::new(Int32Array::from(vec![4, 5, 6, 7, 8, 7, 4])) as _,
    )])?;
    // If you want to find all rows where the expression `a < 5 OR a = 8` is true
    let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));
    // First, you make a "physical expression" from the logical `Expr`
    let physical_expr = physical_expr(&batch.schema(), expr)?;
    // Now, you can evaluate the expression against the RecordBatch
    let result = physical_expr.evaluate(&batch)?;
    // The result contain an array that is true only for where `a < 5 OR a = 8`
    let expected_result = Arc::new(BooleanArray::from(vec![
    true, false, false, false, true, false, true,
    ])) as _;
    assert!(
    matches!(&result, ColumnarValue::Array(r) if r == &expected_result),
    "result: {:?}",
    result
    );
    Ok(())

I would be happy to work on 2 if someone could point me at how people are creating Physical exprs today

@westonpace
Copy link
Member

I think I'd be happy with 2. The example you linked is how we are using datafusion. Here is an updated example that fails with the error:

    // For example, let's say you have some integers in an array
    let b = Arc::new(Int32Array::from(vec![4, 5, 6, 7, 8, 7, 4]));
    let a = Arc::new(StructArray::new(
        vec![Field::new("b", DataType::Int32, false)].into(),
        vec![b],
        None,
    ));
    let batch = RecordBatch::try_from_iter([("a", a as _)])?;

    // If you want to find all rows where the expression `a < 5 OR a = 8` is true
    let expr = col("a")
        .field("b")
        .lt(lit(5))
        .or(col("a").field("b").eq(lit(8)));

    // First, you make a "physical expression" from the logical `Expr`
    let physical_expr = physical_expr(&batch.schema(), expr)?;

@ion-elgreco
Copy link

@alamb this is how we create expressions:

/// Parse a string predicate into an `Expr`
pub(crate) fn parse_predicate_expression(
    schema: &DFSchema,
    expr: impl AsRef<str>,
    df_state: &SessionState,
) -> DeltaResult<Expr> {
    let dialect = &GenericDialect {};
    let mut tokenizer = Tokenizer::new(dialect, expr.as_ref());
    let tokens = tokenizer
        .tokenize()
        .map_err(|err| DeltaTableError::GenericError {
            source: Box::new(err),
        })?;
    let sql = Parser::new(dialect)
        .with_tokens(tokens)
        .parse_expr()
        .map_err(|err| DeltaTableError::GenericError {
            source: Box::new(err),
        })?;

    let context_provider = DeltaContextProvider { state: df_state };
    let sql_to_rel =
        SqlToRel::new_with_options(&context_provider, DeltaParserOptions::default().into());

    Ok(sql_to_rel.sql_to_expr(sql, schema, &mut Default::default())?)
}

@alamb
Copy link
Contributor Author

alamb commented Apr 22, 2024

I'll work on creating an example shortly

@alamb alamb self-assigned this Apr 22, 2024
@westonpace
Copy link
Member

westonpace commented Apr 22, 2024

This also leads to a sort of "variant problem" for any code we write that handles Expr. For example, we have code that walks through an Expr and calculates which (potentially nested) columns are referenced. The "Expr walking" code now has to be aware of both the GetStructField and the ScalarUDF variants of field access.

It would be nicer I think if there was a single canonical way to represent a nested field access in Expr. For example, maybe the translation from GetStructField to ScalarUDF happens as part of the translation from Expr to PhysicalExpr? This way Expr only has GetStructField but there is still the flexibility to customize field access?

@alamb
Copy link
Contributor Author

alamb commented Apr 22, 2024

Here is an example of how to make Expr::struct work in 37.1.0: #10183

I think we need a better API to do this for real (in 38.0.0 and going forward). I will think about this -- maybe @jayzhan211 has some thoughts

@alamb
Copy link
Contributor Author

alamb commented Apr 22, 2024

The "Expr walking" code now has to be aware of both the GetStructField and the ScalarUDF variants of field access.

I think this can be controlled by the consumer -- for example if you are walking Exprs in lancedb, you can control when you transform Expr::GetStructField into ScalarUDF and depending on where you do your analysis you only have to check for one

It would be nicer I think if there was a single canonical way to represent a nested field access in Expr.

I think the idea is people might want to override Expr::GetStructFields semantics and they way they would do so is to rewrite it into a different function. I think this is especially compelling for supporting JSON/JSONB for example

@westonpace
Copy link
Member

I think this can be controlled by the consumer -- for example if you are walking Exprs in lancedb, you can control when you transform Expr::GetStructField into ScalarUDF and depending on where you do your analysis you only have to check for one

I agree (and thanks for the example). This works for us. All of our expr from the user start as &str. However, if we were receiving Expr directly then it wouldn't work because we wouldn't know which approach the user used. This is not a problem for us, we aren't expecting the user to provide direct DF structs in any of our roadmap features, I'm just thinking through possibilities.

I think the idea is people might want to override Expr::GetStructFields semantics and they way they would do so is to rewrite it into a different function. I think this is especially compelling for supporting JSON/JSONB for example

What's the motivation for doing this at the logical level instead of doing this as part of the conversion from logical to physical?

@alamb
Copy link
Contributor Author

alamb commented Apr 22, 2024

What's the motivation for doing this at the logical level instead of doing this as part of the conversion from logical to physical?

I don't think there is any particular motivation (or any reason that the conversion needs to be done at either spot) 🤔

@westonpace
Copy link
Member

I don't think there is any particular motivation (or any reason that the conversion needs to be done at either spot) 🤔

I think, for me, it's just a general unease with having multiple ways of expressing the same thing. I feel like this can lead to "implicit layers" of the plan. For example, there is already some notion of "parse plan", "unoptimized logical plan" and "optimized logical plan", and "physical plan". The middle two are both represented by Expr which can be subtle. Do we now add "rewritten logical plan" to the list? Or maybe "rewritten" and "simplified" are just very transient states between "unoptimized" and "optimized" and I am blowing things out of proportion.

Another way to tackle it could be to leave the concept of a GetIndexedField node at the parsing layer and pull it out of Expr (or deprecate). This would force the conversion to be done between the parse plan and the logical plan.

That being said, my needs are met (thanks again for your help), and perfect is the enemy of the good, so I'm happy to leave well enough alone.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Apr 23, 2024

done between the parse plan and the logical plan

I had also thought about deprecating Expr and use functions directly in parsing phase. I think it might be a good idea. The downside of the current impl is that one needs to register function rewrite rule to convert Expr to Function. I think Register is better not a neccessary step for default behavior. If we have functions in parsing phase, no additional step (register rewrite) is needed.

What's the motivation for doing this at the logical level instead of doing this as part of the conversion from logical to physical?

One good reason is that we don't need to care about physical expr if we converted it to functions in logical level. I think the early we optimize, the less duplicated things we leave. if we move this rewrite to logical-to-physical step, we need to map it to somewhat of physical-expr, which apparently is not a better idea.

I think the idea is people might want to override Expr::GetStructFields semantics and they way they would do so is to rewrite it into a different function. I think this is especially compelling for supporting JSON/JSONB for example

If we have functions after parsing, they can rewrite functions to their expected one through register their own rewrite rules, so I think it is also not a problem

Do we now add "rewritten logical plan" to the list?

"rewritten / simplified logical plan" is actually "optimized logical plan" to me, "optimized" is more general term.

I think we need a better API to do this for real (in 38.0.0 and going forward). I will think about this -- maybe @jayzhan211 has some thoughts

I think we are talking about better API design of get_field right? I think we can take reference from others.
Duckdb has struct_extract and map_extract

@alamb
Copy link
Contributor Author

alamb commented Apr 23, 2024

Another way to tackle it could be to leave the concept of a GetIndexedField node at the parsing layer and pull it out of Expr (or deprecate). This would force the conversion to be done between the parse plan and the logical plan.

I agree this would be clearest (basically remove Expr::GetIndexedField and always use a function.

I think we are talking about better API design of get_field right?

I was actually thinking about an API for the usecase of "I created an Expr programatically and I want to convert it to a PhyscalExpr I can execute. We do this in InfluxDB IOx. I didn't realize that both LanceDB and Delta did the same thing

Currently there is an example of how to do this in expr_api.rs:

/// Build a physical expression from a logical one, after applying simplification and type coercion
pub fn physical_expr(schema: &Schema, expr: Expr) -> Result<Arc<dyn PhysicalExpr>> {

Maybe it is time to "promote" that function to a real API in datafusion-core somewhere where it can be tested and better documented.

I think we can take reference from others.
Duckdb has struct_extract and map_extract

Those are interesting functions -- now that we have the notion of function packages, adding better support for the Map datatype would be sweet.

@alamb
Copy link
Contributor Author

alamb commented Apr 23, 2024

I am thinking I'll try and make a PR with such an API over the next day or two to see how it might look

@ion-elgreco
Copy link

ion-elgreco commented Apr 25, 2024

@alamb just checking in, is there something we need to refactor? Or are you simplifying the API and automatically handling this within datafusion on all code paths?

@alamb
Copy link
Contributor Author

alamb commented Apr 25, 2024

@alamb just checking in, is there something we need to refactor? Or are you simplifying the API and automatically handling this within datafusion on all code paths?

I suggest for the time being you use the pattern here (is this possible)?

Here is an example of how to make Expr::struct work in 37.1.0: #10183

For the next release (38.0.0) I was thinking would move the create_physical_expr function into the core (rather than an example).

Thoughts?

@ion-elgreco
Copy link

@alamb just checking in, is there something we need to refactor? Or are you simplifying the API and automatically handling this within datafusion on all code paths?

I suggest for the time being you use the pattern here (is this possible)?

Here is an example of how to make Expr::struct work in 37.1.0: #10183

For the next release (38.0.0) I was thinking would move the create_physical_expr function into the core (rather than an example).

Thoughts?

I am not entirely sure, in delta-rs we just parse SQL strings into logical exprs and then convert it to physical exprs without doing any adjustments there.

I think it makes sense that these expr conversion are automatically done in core. Because logically between v36, 37 nothing changed in the intent of the operation. That the physical impl is different should be abstracted.

@alamb
Copy link
Contributor Author

alamb commented Apr 25, 2024

I am not entirely sure, in delta-rs we just parse SQL strings into logical exprs and then convert it to physical exprs without doing any adjustments there.

I think in 37.0.0 you'll need to update the code that converts to a physical expr to call the code in #10183

I think it makes sense that these expr conversion are automatically done in core.

Yes I agree. I will make a PR over the next few days with a proposed API

@alamb
Copy link
Contributor Author

alamb commented May 1, 2024

Here is a PR with a proposed new API: #10330

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants