Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(query): StringSearchLike vector_vector can not match '\n' #8359

Conversation

TCeason
Copy link
Collaborator

@TCeason TCeason commented Oct 20, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Before this pr

query "select 'hello\n' like 'h%'; " will return false.

Because rust regexp lib : . will match any valid UTF-8 encoded Unicode scalar value except for \n. (To also match \n, enable the s flag, e.g., (?s:.).)

More description in #8354

Fixes #8354

@vercel
Copy link

vercel bot commented Oct 20, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) Oct 20, 2022 at 9:44AM (UTC)

@TCeason TCeason requested a review from sundy-li October 20, 2022 09:44
@mergify mergify bot added the pr-bugfix this PR patches a bug in codebase label Oct 20, 2022
@TCeason
Copy link
Collaborator Author

TCeason commented Oct 20, 2022

The regexp has the same question because we convert the pattern directly.

#[inline]
fn build_regexp_from_pattern(pat: &[u8]) -> Result<BytesRegex> {
    let pattern = match pat.is_empty() {
        true => "^$",
        false => simdutf8::basic::from_utf8(pat).map_err(|e| {
            ErrorCode::BadArguments(format!(
                "Unable to convert the REGEXP pattern to string: {}",
                e
            ))
        })?,
    };

    BytesRegexBuilder::new(pattern)
        .case_insensitive(true)
        .build()
        .map_err(|e| {
            ErrorCode::BadArguments(format!("Unable to build regex from REGEXP pattern: {}", e))
        })
}

Maybe that's too rust style?

'root'@mysqldb 17:46:00 [(none)]> select 'hello\n' regexp '^h.*$';
+-------------------------+
| 'hello
' regexp '^h.*$' |
+-------------------------+
|                       0 |
+-------------------------+
1 row in set (0.07 sec)
Read 1 rows, 1.00 B in 0.005 sec., 187.23 rows/sec., 187.23 B/sec.

'root'@mysqldb 17:46:25 [(none)]> select 'hello\n' regexp '^h(?s:.)*$';
+------------------------------+
| 'hello
' regexp '^h(?s:.)*$' |
+------------------------------+
|                            1 |
+------------------------------+
1 row in set (0.05 sec)
Read 1 rows, 1.00 B in 0.005 sec., 186.81 rows/sec., 186.81 B/sec.

'root'@mysqldb 17:46:32 [(none)]> create table t(id String);
Query OK, 0 rows affected (0.04 sec)

'root'@mysqldb 17:48:35 [(none)]> insert into t values('hello\n'), ('h\n');
Query OK, 0 rows affected (0.10 sec)

'root'@mysqldb 17:48:57 [(none)]> select * from t where id regexp '^h.*$';
Empty set (0.09 sec)
Read 2 rows, 32.00 B in 0.026 sec., 76.12 rows/sec., 1.19 KiB/sec.

'root'@mysqldb 17:49:24 [(none)]> select * from t where id regexp '^h(?s:.)*$';
+--------+
| id     |
+--------+
| hello
 |
| h
     |
+--------+
2 rows in set (0.08 sec)
Read 2 rows, 32.00 B in 0.020 sec., 99.27 rows/sec., 1.55 KiB/sec.

'root'@mysqldb 17:49:37 [(none)]> select * from t where id regexp '^h(?s:.)$';
+------+
| id   |
+------+
| h
   |
+------+
1 row in set (0.07 sec)
Read 2 rows, 32.00 B in 0.035 sec., 57.76 rows/sec., 924.14 B/sec.

@sundy-li sundy-li requested a review from FANNG1 October 20, 2022 11:50
@BohuTANG BohuTANG merged commit 04ac812 into datafuselabs:main Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-bugfix this PR patches a bug in codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: like match failed if column content has '\n'
3 participants