Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](function) fix unexpected be core in string search function #31312

Merged
merged 4 commits into from
Feb 26, 2024

Conversation

xy720
Copy link
Member

@xy720 xy720 commented Feb 23, 2024

Proposed changes

Issue Number: close #31311

Keep same behavior with ck:
ck docs

Code: 43, Message: Illegal type String of argument of function multiMatchAny

Before:

CREATE TABLE `tbl2` (
  `ka` int(11) NULL,
  `kb` text NULL,
  `kc` array<text> NULL
) ENGINE=OLAP
UNIQUE KEY(`ka`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`ka`) BUCKETS 5
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);
MySQL [test]> select multi_match_any(kb, 'hello') from tbl2 order by col1;
ERROR 1105 (HY000): RpcException, msg: org.apache.doris.rpc.RpcException: io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason

After:

MySQL [test]> select multi_match_any(kb, 'hello') from tbl2 order by col1;
ERROR 1105 (HY000): errCode = 2, detailMessage = No matching function with signature: multi_match_any(text, varchar(-1)).

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

@xy720
Copy link
Member Author

xy720 commented Feb 23, 2024

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 35.71% (8547/23936)
Line Coverage: 27.53% (69360/251944)
Region Coverage: 26.68% (35976/134843)
Branch Coverage: 23.48% (18388/78304)
Coverage Report: http://coverage.selectdb-in.cc/coverage/a357282735bd5e4a6cff5818bf4c3e1c1156bb54_a357282735bd5e4a6cff5818bf4c3e1c1156bb54/report/index.html

@xy720
Copy link
Member Author

xy720 commented Feb 23, 2024

run buildall

Copy link
Contributor

@cambyzju cambyzju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 23, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 35.71% (8548/23937)
Line Coverage: 27.52% (69361/251995)
Region Coverage: 26.68% (35971/134848)
Branch Coverage: 23.49% (18390/78304)
Coverage Report: http://coverage.selectdb-in.cc/coverage/acbf68f97b059210c9db3f42651a7fddb6b3a367_acbf68f97b059210c9db3f42651a7fddb6b3a367/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 40756 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit acbf68f97b059210c9db3f42651a7fddb6b3a367, data reload: false

------ Round 1 ----------------------------------
q1	17695	4942	4796	4796
q2	2052	138	138	138
q3	10576	982	980	980
q4	4646	949	952	949
q5	7627	3128	3142	3128
q6	193	130	134	130
q7	1220	761	756	756
q8	9236	2007	2043	2007
q9	7420	6533	6560	6533
q10	8289	2619	2628	2619
q11	415	203	223	203
q12	784	329	328	328
q13	17927	3643	3626	3626
q14	285	255	258	255
q15	611	532	496	496
q16	468	395	407	395
q17	906	856	858	856
q18	7417	6644	6546	6546
q19	1551	1476	1485	1476
q20	544	268	268	268
q21	7062	3954	3936	3936
q22	865	343	335	335
Total cold run time: 107789 ms
Total hot run time: 40756 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4774	4804	4786	4786
q2	292	179	178	178
q3	3572	3562	3573	3562
q4	2472	2473	2494	2473
q5	5721	5719	5687	5687
q6	212	128	129	128
q7	2221	1620	1654	1620
q8	2989	3061	3049	3049
q9	8642	8619	8626	8619
q10	6754	4229	4228	4228
q11	507	379	351	351
q12	769	529	531	529
q13	4481	3406	3390	3390
q14	274	228	227	227
q15	584	510	500	500
q16	482	410	437	410
q17	1605	1573	1565	1565
q18	8296	7619	7624	7619
q19	1631	1629	1619	1619
q20	2102	1804	1816	1804
q21	6459	6057	6076	6057
q22	570	526	519	519
Total cold run time: 65409 ms
Total hot run time: 58920 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 175568 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit acbf68f97b059210c9db3f42651a7fddb6b3a367, data reload: false

query1	918	352	342	342
query2	6520	1784	1756	1756
query3	6694	208	205	205
query4	23100	21175	21066	21066
query5	4312	374	376	374
query6	261	164	165	164
query7	4625	304	302	302
query8	244	194	193	193
query9	8429	2829	2818	2818
query10	421	227	230	227
query11	15131	14481	14425	14425
query12	140	88	80	80
query13	1699	427	426	426
query14	9244	7583	7631	7583
query15	214	183	195	183
query16	7643	261	250	250
query17	1401	550	523	523
query18	1943	267	257	257
query19	197	143	149	143
query20	84	81	82	81
query21	186	113	113	113
query22	4823	4766	4675	4675
query23	32355	31511	31426	31426
query24	12674	3332	3411	3332
query25	634	367	359	359
query26	1867	155	161	155
query27	3056	316	310	310
query28	6597	1835	1807	1807
query29	1096	612	612	612
query30	276	137	147	137
query31	940	741	754	741
query32	94	57	58	57
query33	723	238	238	238
query34	1082	484	492	484
query35	936	821	799	799
query36	1003	895	887	887
query37	183	60	65	60
query38	3284	3113	3193	3113
query39	1357	1328	1325	1325
query40	280	104	99	99
query41	37	35	34	34
query42	100	102	98	98
query43	472	451	461	451
query44	1087	705	700	700
query45	194	187	182	182
query46	1022	785	756	756
query47	1599	1577	1568	1568
query48	428	339	347	339
query49	1218	296	306	296
query50	758	372	367	367
query51	4535	4289	4268	4268
query52	108	95	97	95
query53	393	296	296	296
query54	292	232	233	232
query55	86	79	82	79
query56	225	210	195	195
query57	1093	936	954	936
query58	212	197	192	192
query59	2364	2318	2108	2108
query60	243	221	227	221
query61	81	82	81	81
query62	594	392	366	366
query63	319	284	298	284
query64	6204	3042	3102	3042
query65	3295	3266	3244	3244
query66	1334	335	323	323
query67	14311	14140	13792	13792
query68	5067	550	567	550
query69	519	364	346	346
query70	1288	1233	1238	1233
query71	452	253	260	253
query72	6313	2792	2568	2568
query73	690	316	316	316
query74	6865	6369	6477	6369
query75	3208	2563	2542	2542
query76	3266	1100	1186	1100
query77	344	240	239	239
query78	9425	8742	8657	8657
query79	968	499	493	493
query80	534	369	351	351
query81	441	204	202	202
query82	163	83	86	83
query83	135	122	120	120
query84	232	77	78	77
query85	1005	355	337	337
query86	292	320	308	308
query87	3485	3299	3285	3285
query88	2743	2288	2292	2288
query89	441	369	374	369
query90	1902	166	163	163
query91	153	126	125	125
query92	53	46	47	46
query93	982	520	498	498
query94	1093	174	174	174
query95	425	336	344	336
query96	574	262	264	262
query97	4396	4271	4262	4262
query98	217	203	199	199
query99	1065	745	759	745
Total cold run time: 268006 ms
Total hot run time: 175568 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.35 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit acbf68f97b059210c9db3f42651a7fddb6b3a367, data reload: false

query1	0.03	0.02	0.02
query2	0.06	0.02	0.02
query3	0.24	0.08	0.07
query4	1.65	0.08	0.08
query5	0.48	0.48	0.48
query6	1.37	0.61	0.61
query7	0.02	0.01	0.01
query8	0.04	0.02	0.03
query9	0.53	0.44	0.44
query10	0.47	0.48	0.48
query11	0.12	0.10	0.09
query12	0.11	0.10	0.10
query13	0.58	0.59	0.58
query14	0.76	0.78	0.80
query15	0.82	0.78	0.79
query16	0.32	0.33	0.33
query17	0.85	0.90	0.90
query18	0.17	0.17	0.18
query19	1.79	1.70	1.60
query20	0.02	0.01	0.01
query21	15.41	0.68	0.60
query22	3.13	5.04	2.88
query23	17.73	1.01	0.96
query24	1.96	0.41	0.37
query25	0.63	0.06	0.06
query26	0.16	0.14	0.13
query27	0.05	0.06	0.05
query28	12.21	0.81	0.81
query29	12.50	3.42	3.35
query30	0.55	0.57	0.47
query31	2.78	0.35	0.37
query32	3.36	0.47	0.47
query33	3.14	3.13	3.11
query34	15.35	4.51	4.49
query35	4.50	4.48	4.47
query36	1.07	0.93	0.95
query37	0.07	0.05	0.05
query38	0.03	0.03	0.03
query39	0.02	0.01	0.02
query40	0.16	0.14	0.18
query41	0.07	0.02	0.01
query42	0.02	0.02	0.02
query43	0.03	0.02	0.02
Total cold run time: 105.36 s
Total hot run time: 31.35 s

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit acbf68f97b059210c9db3f42651a7fddb6b3a367 with default session variables
Stream load json:         19 seconds loaded 2358488459 Bytes, about 118 MB/s
Stream load orc:          59 seconds loaded 1101869774 Bytes, about 17 MB/s
Stream load parquet:      32 seconds loaded 861443392 Bytes, about 25 MB/s
Insert into select:       16.4 seconds inserted 10000000 Rows, about 609K ops/s

@@ -495,6 +495,12 @@ public static boolean isCastMatchAllowed(Function desc, Function candicate) {
return false;
}
}
if (FunctionCallExpr.STRING_SEARCH_FUNCTION_SET.contains(desc.functionName())) {
if (descArgTypes[1].isStringType() && candicateArgTypes[1].isArrayType()) {
// The needles arg of search functions should not be allowed to cast from string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is not allowed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhiqiang-hhhh That's correct. see testcase.

Copy link
Member Author

@xy720 xy720 Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prohibit casting the needles argument from string -> array<text>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why cast sting to array leading to core?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the cast function wrapped a nullable layer above the result type when it encountered a string input.

Copy link
Contributor

@zclllyybb zclllyybb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zclllyybb zclllyybb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need more discuss. hold on now. we may need a complete fix on planner to prohibit string to array<string>. @morrySnow PTAL

Copy link
Contributor

@zhiqiang-hhhh zhiqiang-hhhh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to throw exception when we meeting NULL as argument ?

@xy720
Copy link
Member Author

xy720 commented Feb 26, 2024

Do we need to throw exception when we meeting NULL as argument ?

For this type of case, CK will return nil instead of throwing an error, and Doris are currently returning 0. Do we need to maintain consistency with CK, as this function was migrated from CK?

@zhiqiang-hhhh
Copy link
Contributor

Signature of this function in doris suggests that the result is always not nullable, if we changed that, there will be many changes on BE, for example maybe we should set use_default_impl_for_nulls to true.

I think its ok to keep current behavior.

@xy720

@xy720
Copy link
Member Author

xy720 commented Feb 26, 2024

need more discuss. hold on now. we may need a complete fix on planner to prohibit string to array<string>. @morrySnow PTAL

@zclllyybb @morrySnow Do we still need to discuss it? Perhaps you can provide me with an example so that I can fix it in the next PR.

@zclllyybb
Copy link
Contributor

need more discuss. hold on now. we may need a complete fix on planner to prohibit string to array<string>. @morrySnow PTAL

@zclllyybb @morrySnow Do we still need to discuss it? Perhaps you can provide me with an example so that I can fix it in the next PR.

I think the better way is totally forbid implicit casting from string to array<string> when do type coercion for function signature matches, rather than use a special set STRING_SEARCH_FUNCTION_SET to avoid this. could you please handle this? or you can just fix the origin problem and leave it as a individual task

@xy720
Copy link
Member Author

xy720 commented Feb 26, 2024

need more discuss. hold on now. we may need a complete fix on planner to prohibit string to array<string>. @morrySnow PTAL

@zclllyybb @morrySnow Do we still need to discuss it? Perhaps you can provide me with an example so that I can fix it in the next PR.

I think the better way is totally forbid implicit casting from string to array<string> when do type coercion for function signature matches, rather than use a special set STRING_SEARCH_FUNCTION_SET to avoid this. could you please handle this? or you can just fix the origin problem and leave it as a individual task

Completely forbid implicit casting may affect many behaviors. For example, should we ban the following case?

case: implicit casting in insert stmt

MySQL [test]> desc tbl2;
+-------+-------------+------+-------+---------+---------+
| Field | Type        | Null | Key   | Default | Extra   |
+-------+-------------+------+-------+---------+---------+
| ka    | INT         | Yes  | true  | NULL    |         |
| kb    | TEXT        | Yes  | false | NULL    | REPLACE |
| kc    | ARRAY<TEXT> | Yes  | false | NULL    | REPLACE |
+-------+-------------+------+-------+---------+---------+
3 rows in set (0.01 sec)

MySQL [test]> insert into tbl2 values(1, '123', '[123,123]');
Query OK, 1 row affected (0.07 sec)
{'label':'insert_6e5f2205983f49a7_8767898ea1995eb2', 'status':'VISIBLE', 'txnId':'12019'}

And I think implicitly casting string to other type is a tradition in doris (ex. string->(int, date,jsonb)), but unfortunately not in Clickhouse, if we want to change it, we really need more discuss about it, may be I should leave it as an individual task.

@xy720 xy720 merged commit 87201c0 into apache:master Feb 26, 2024
28 of 32 checks passed
@zclllyybb
Copy link
Contributor

need more discuss. hold on now. we may need a complete fix on planner to prohibit string to array<string>. @morrySnow PTAL

@zclllyybb @morrySnow Do we still need to discuss it? Perhaps you can provide me with an example so that I can fix it in the next PR.

I think the better way is totally forbid implicit casting from string to array<string> when do type coercion for function signature matches, rather than use a special set STRING_SEARCH_FUNCTION_SET to avoid this. could you please handle this? or you can just fix the origin problem and leave it as a individual task

Completely forbid implicit casting may affect many behaviors. For example, should we ban the following case?

case: implicit casting in insert stmt

MySQL [test]> desc tbl2;
+-------+-------------+------+-------+---------+---------+
| Field | Type        | Null | Key   | Default | Extra   |
+-------+-------------+------+-------+---------+---------+
| ka    | INT         | Yes  | true  | NULL    |         |
| kb    | TEXT        | Yes  | false | NULL    | REPLACE |
| kc    | ARRAY<TEXT> | Yes  | false | NULL    | REPLACE |
+-------+-------------+------+-------+---------+---------+
3 rows in set (0.01 sec)

MySQL [test]> insert into tbl2 values(1, '123', '[123,123]');
Query OK, 1 row affected (0.07 sec)
{'label':'insert_6e5f2205983f49a7_8767898ea1995eb2', 'status':'VISIBLE', 'txnId':'12019'}

And I think implicitly casting string to other type is a tradition in doris (ex. string->(int, date,jsonb)), but unfortunately not in Clickhouse, if we want to change it, we really need more discuss about it, may be I should leave it as an individual task.

  1. cast string to another type is acceptable. but to array<string>(generally, to all type which have a subtype of string) is a special case. the users may not sure what they are doing, like '[1,2,3]' means ['1','2','3'] or ['[1,2,3]']? for this particular scenario, maybe forbid to implicit cast is a good decision. If they need, could use a more clear way or explicit cast.
  2. for functions' match, we can individually deal some of the type coercion rules. so there still need more discussion about if we should change the scenario you mentioned together.

looking forward for more opinion!

@xy720
Copy link
Member Author

xy720 commented Feb 26, 2024

need more discuss. hold on now. we may need a complete fix on planner to prohibit string to array<string>. @morrySnow PTAL

@zclllyybb @morrySnow Do we still need to discuss it? Perhaps you can provide me with an example so that I can fix it in the next PR.

I think the better way is totally forbid implicit casting from string to array<string> when do type coercion for function signature matches, rather than use a special set STRING_SEARCH_FUNCTION_SET to avoid this. could you please handle this? or you can just fix the origin problem and leave it as a individual task

Completely forbid implicit casting may affect many behaviors. For example, should we ban the following case?
case: implicit casting in insert stmt

MySQL [test]> desc tbl2;
+-------+-------------+------+-------+---------+---------+
| Field | Type        | Null | Key   | Default | Extra   |
+-------+-------------+------+-------+---------+---------+
| ka    | INT         | Yes  | true  | NULL    |         |
| kb    | TEXT        | Yes  | false | NULL    | REPLACE |
| kc    | ARRAY<TEXT> | Yes  | false | NULL    | REPLACE |
+-------+-------------+------+-------+---------+---------+
3 rows in set (0.01 sec)

MySQL [test]> insert into tbl2 values(1, '123', '[123,123]');
Query OK, 1 row affected (0.07 sec)
{'label':'insert_6e5f2205983f49a7_8767898ea1995eb2', 'status':'VISIBLE', 'txnId':'12019'}

And I think implicitly casting string to other type is a tradition in doris (ex. string->(int, date,jsonb)), but unfortunately not in Clickhouse, if we want to change it, we really need more discuss about it, may be I should leave it as an individual task.

  1. cast string to another type is acceptable. but to array<string>(generally, to all type which have a subtype of string) is a special case. the users may not sure what they are doing, like '[1,2,3]' means ['1','2','3'] or ['[1,2,3]']? for this particular scenario, maybe forbid to implicit cast is a good decision. If they need, could use a more clear way or explicit cast.
  2. for functions' match, we can individually deal some of the type coercion rules. so there still need more discussion about if we should change the scenario you mentioned together.

looking forward for more opinion!

Got it. I will try to forbid the case 1 later, that doesn't require a lot of work. For case 2, we can leave it as an individual task to be discussed.

@morrySnow
Copy link
Contributor

morrySnow commented Feb 26, 2024

@zclllyybb @xy720 . Cast string to array is ok. We parse it with json format. string '[1,2,3]' cast to array<string> will produce text array with 3 elements: '1', '2' and '3'. It is not ambiguous. All strings that do not meet the required format will be converted to null.
@amorynan , could u provide more details?

yiguolei pushed a commit that referenced this pull request Feb 26, 2024
)

Fix be core in multi_match_any/multi_search_all_positions functions.
yiguolei pushed a commit that referenced this pull request Feb 26, 2024
)

Fix be core in multi_match_any/multi_search_all_positions functions.
yiguolei pushed a commit that referenced this pull request Feb 27, 2024
)

Fix be core in multi_match_any/multi_search_all_positions functions.
xy720 added a commit that referenced this pull request Feb 27, 2024
@xy720
Copy link
Member Author

xy720 commented Feb 27, 2024

@zclllyybb @xy720 . Cast string to array is ok. We parse it with json format. string '[1,2,3]' cast to array<string> will produce text array with 3 elements: '1', '2' and '3'. It is not ambiguous. All strings that do not meet the required format will be converted to null.
@amorynan , could u provide more details?

@amorynan, @morrySnow Any more details? Do we already support casting string to array with json format now ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.6-merged p0_c reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug] unexpected be core in string search function
8 participants