Skip to content

Conversation

@jackwener
Copy link
Member

Which issue does this PR close?

Closes #5893.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Apr 6, 2023
@jackwener jackwener changed the title fix: like and ilike not supported for LargeUtf8 fix: binaryExpr not supported for LargeUtf8 Apr 6, 2023
@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Apr 6, 2023
# like and ilike for LargeUtf8
# issue: https://github.com/apache/arrow-datafusion/issues/5893
query B
select arrow_cast('foo', 'LargeUtf8') like '%foo%';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, btw, why is it like '%foo%', not == 'foo' ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a case of issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a case of issue

Sorry, I still didn't get it, you mean inner value of Utf8("foo") not the same as inner value LargeUtf8("foo")?

Copy link
Member Author

@jackwener jackwener Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's not clear.😂
This test cover issue in #5893
English isn't my native language, so I subconsciously thought issue is github issue

Copy link

@SimonSchneider SimonSchneider Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a more comprehensive test would be

select arrow_cast('foobar', 'LargeUtf8') like '%oba%';
—-
true


select arrow_cast('FooBar', 'LargeUtf8') ilike '%oba%';
—-
true

https://www.postgresql.org/docs/7.3/functions-matching.html#:~:text=The%20keyword%20ILIKE%20can%20be,and%20~~*%20corresponds%20to%20ILIKE%20.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another test that would be great to add would be to compare columns as well as constants (as that is the other binary expr codepath)

Something like

create table t as values 
  (arrow_cast('Foo', 'LargeUtf8') '%f'), 
  (arrow_cast('Bar', 'LargeUtf8') 'B%');

select column1 like column2 from t;
select column1 ilike column2 from t;
select column1 not like column2 from t;
select column1 not ilike column2 from t;

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @jackwener

# like and ilike for LargeUtf8
# issue: https://github.com/apache/arrow-datafusion/issues/5893
query B
select arrow_cast('foo', 'LargeUtf8') like '%foo%';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another test that would be great to add would be to compare columns as well as constants (as that is the other binary expr codepath)

Something like

create table t as values 
  (arrow_cast('Foo', 'LargeUtf8') '%f'), 
  (arrow_cast('Bar', 'LargeUtf8') 'B%');

select column1 like column2 from t;
select column1 ilike column2 from t;
select column1 not like column2 from t;
select column1 not ilike column2 from t;

@mingmwang
Copy link
Contributor

I remember there are some other issues with LargeUtf8, for example, can not Sort
Maybe we need another unit test suite to cover all different Exprs with LargeUtf8.

@mingmwang
Copy link
Contributor

#[tokio::test]
async fn sort_on_window_null_string() -> Result<()> {
    let d1: DictionaryArray<Int32Type> =
        vec![Some("one"), None, Some("three")].into_iter().collect();
    let d2: StringArray = vec![Some("ONE"), None, Some("THREE")].into_iter().collect();
    let d3: LargeStringArray =
        vec![Some("One"), None, Some("Three")].into_iter().collect();

    let batch = RecordBatch::try_from_iter(vec![
        ("d1", Arc::new(d1) as ArrayRef),
        ("d2", Arc::new(d2) as ArrayRef),
        ("d3", Arc::new(d3) as ArrayRef),
    ])
    .unwrap();

    let ctx = SessionContext::with_config(SessionConfig::new().with_target_partitions(1));
    ctx.register_batch("test", batch)?;

    let sql =
        "SELECT d1, row_number() OVER (partition by d1) as rn1 FROM test order by d1 asc";

    let actual = execute_to_batches(&ctx, sql).await;
    // NULLS LAST
    let expected = vec![
        "+-------+-----+",
        "| d1    | rn1 |",
        "+-------+-----+",
        "| one   | 1   |",
        "| three | 1   |",
        "|       | 1   |",
        "+-------+-----+",
    ];
    assert_batches_eq!(expected, &actual);

    let sql =
        "SELECT d2, row_number() OVER (partition by d2) as rn1 FROM test ORDER BY d2 asc";
    let actual = execute_to_batches(&ctx, sql).await;
    // NULLS LAST
    let expected = vec![
        "+-------+-----+",
        "| d2    | rn1 |",
        "+-------+-----+",
        "| ONE   | 1   |",
        "| THREE | 1   |",
        "|       | 1   |",
        "+-------+-----+",
    ];
    assert_batches_eq!(expected, &actual);

    let sql =
        "SELECT d2, row_number() OVER (partition by d2 order by d2 desc) as rn1 FROM test ORDER BY d2 desc";

    let actual = execute_to_batches(&ctx, sql).await;
    // NULLS FIRST
    let expected = vec![
        "+-------+-----+",
        "| d2    | rn1 |",
        "+-------+-----+",
        "|       | 1   |",
        "| THREE | 1   |",
        "| ONE   | 1   |",
        "+-------+-----+",
    ];
    assert_batches_eq!(expected, &actual);

    // FIXME sort on LargeUtf8 String has bug.
    // let sql =
    //     "SELECT d3, row_number() OVER (partition by d3) as rn1 FROM test";
    // let actual = execute_to_batches(&ctx, sql).await;
    // let expected = vec![
    //     "+-------+-----+",
    //     "| d3    | rn1 |",
    //     "+-------+-----+",
    //     "|       | 1   |",
    //     "| One   | 1   |",
    //     "| Three | 1   |",
    //     "+-------+-----+",
    // ];
    // assert_batches_eq!(expected, &actual);

    Ok(())
}

@jackwener
Copy link
Member Author

jackwener commented Apr 8, 2023

@mingmwang Can we add it in following PR?

Look like it don't relate with this PR

@jackwener jackwener requested a review from alamb April 9, 2023 13:01
@jackwener
Copy link
Member Author

if there isn't new review, I prepare to merge this PR in the tomorrow due to some review already in this PR.

@jackwener jackwener merged commit d9cffa6 into apache:main Apr 10, 2023
@jackwener jackwener deleted the fix_large_utf8 branch April 10, 2023 14:39
@alamb
Copy link
Contributor

alamb commented Apr 10, 2023

I agree adding additional tests for unrelated functionality (e.g. general support for LargeUtf8) would be best done in some other PR

korowa pushed a commit to korowa/arrow-datafusion that referenced this pull request Apr 13, 2023
* fix: like and ilike not supported for LargeUtf8

* add test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

like and ilike not supported for LargeUtf8

5 participants