Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to perform lead/lag built in functions on List and Struct data types #10328

Closed
timsaucer opened this issue May 1, 2024 · 1 comment · Fixed by #10329
Closed

Unable to perform lead/lag built in functions on List and Struct data types #10328

timsaucer opened this issue May 1, 2024 · 1 comment · Fixed by #10329
Labels
bug Something isn't working

Comments

@timsaucer
Copy link
Contributor

Describe the bug

When you use the lead or lag built in functions and the data type is either a list or struct, you will get a panic with error Exception: Arrow error: Compute error: concat requires input of at least one array

I have root caused this to list_to_array_of_size in datafusion/common/src/scalar/mod.rs where we do not check to see if the arrays we are attempting to concat have any contents, which they will not because in WindowAggState::new() we are calling to_array_of_size(0). These calls work for primitive data, but for list data we need an additional check. I am submitting a PR to resolve the issue.

To Reproduce

Data file is a simple csv:

a,b,c
1,2,3
4,5,6
7,8,9
10,11,12

Code to reproduce:

use datafusion::{logical_expr::{expr::WindowFunction, BuiltInWindowFunction, WindowFrame, WindowFunctionDefinition}, prelude::*};

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {

    let ctx = SessionContext::new();
    let mut df = ctx.read_csv("/Users/tsaucer/working/testing_ballista/lead_lag/example.csv", CsvReadOptions::default()).await?;

    df = df.with_column("array_col", make_array(vec![col("a"), col("b"), col("c")]))?;

    df.clone().show().await?;

    let lag_expr = Expr::WindowFunction(WindowFunction::new(
        WindowFunctionDefinition::BuiltInWindowFunction(
            BuiltInWindowFunction::Lead,
        ),
        vec![col("array_col")],
        vec![],
        vec![],
        WindowFrame::new(None),
        None,
    ));

    df = df.select(vec![col("a"), col("b"), col("c"), col("array_col"), lag_expr.alias("lagged")])?;

    df.show().await?;

    Ok(())
}

Results:

+----+----+----+--------------+
| a  | b  | c  | array_col    |
+----+----+----+--------------+
| 1  | 2  | 3  | [1, 2, 3]    |
| 4  | 5  | 6  | [4, 5, 6]    |
| 7  | 8  | 9  | [7, 8, 9]    |
| 10 | 11 | 12 | [10, 11, 12] |
+----+----+----+--------------+
Error: ArrowError(ComputeError("concat requires input of at least one array"), None)

Expected behavior

Expect lag to work on these structures. Here is output from the PR I will put up shortly.

+----+----+----+--------------+
| a  | b  | c  | array_col    |
+----+----+----+--------------+
| 1  | 2  | 3  | [1, 2, 3]    |
| 4  | 5  | 6  | [4, 5, 6]    |
| 7  | 8  | 9  | [7, 8, 9]    |
| 10 | 11 | 12 | [10, 11, 12] |
+----+----+----+--------------+
+----+----+----+--------------+--------------+
| a  | b  | c  | array_col    | lagged       |
+----+----+----+--------------+--------------+
| 1  | 2  | 3  | [1, 2, 3]    | [4, 5, 6]    |
| 4  | 5  | 6  | [4, 5, 6]    | [7, 8, 9]    |
| 7  | 8  | 9  | [7, 8, 9]    | [10, 11, 12] |
| 10 | 11 | 12 | [10, 11, 12] |              |
+----+----+----+--------------+--------------+

Additional context

This is the root cause for apache/datafusion-python#647

@alamb
Copy link
Contributor

alamb commented May 2, 2024

I saw this example and it reminded me how hard it is to create window functions with the expr api -- it would be great to make this better. See #6747

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants