Skip to content

[fix](csv reader) fix incorrect column parsing when using enclose for CSV files with UTF-8 BOM#60864

Open
sollhui wants to merge 1 commit intoapache:masterfrom
sollhui:fix_parse_utf-8_bom
Open

[fix](csv reader) fix incorrect column parsing when using enclose for CSV files with UTF-8 BOM#60864
sollhui wants to merge 1 commit intoapache:masterfrom
sollhui:fix_parse_utf-8_bom

Conversation

@sollhui
Copy link
Contributor

@sollhui sollhui commented Feb 27, 2026

Background

When reading CSV files with UTF-8 BOM (Byte Order Mark) and enclose character enabled
(e.g., enclose = '"'), the column names and data values are parsed incorrectly.

Root Cause

In enclose mode, EncloseCsvLineReaderCtx pre-computes column_sep_positions (absolute
byte offsets of column separators) during read_line(). These positions are calculated on
the raw line data including the 3-byte BOM (0xEF 0xBB 0xBF).

Later, CsvReader::_remove_bom() shifts the data pointer forward by 3 bytes, but the
pre-computed column_sep_positions are not adjusted accordingly. When
EncloseCsvTextFieldSplitter::do_split() uses these stale positions on the shifted pointer,
all field boundaries are off by 3 bytes, resulting in corrupted column names and data.

This bug does not affect the non-enclose mode, because PlainCsvTextFieldSplitter
scans the data on-the-fly rather than relying on pre-computed positions.

Fix

  • Add adjust_column_sep_positions(size_t offset) to EncloseCsvLineReaderCtx to subtract
    the given offset from all pre-computed separator positions.
  • Store the EncloseCsvLineReaderCtx reference in CsvReader when enclose mode is active.
  • Call the adjustment in _remove_bom() when BOM is detected, so all call sites
    (_parse_col_names, _parse_col_nums, get_next_block) are automatically fixed.

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Contributor Author

sollhui commented Feb 27, 2026

run buildall

@sollhui sollhui force-pushed the fix_parse_utf-8_bom branch from 8bf3090 to d97fb1a Compare February 27, 2026 03:16
@sollhui
Copy link
Contributor Author

sollhui commented Feb 27, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 28774 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d97fb1a92f37e163f6f56105a8ab53e89881aaea, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17627	4443	4294	4294
q2	q3	10653	771	509	509
q4	4682	354	264	264
q5	7557	1220	1037	1037
q6	182	175	143	143
q7	799	857	660	660
q8	9353	1487	1371	1371
q9	5002	4806	4673	4673
q10	6837	1872	1646	1646
q11	466	249	235	235
q12	744	566	469	469
q13	17820	4223	3433	3433
q14	236	233	216	216
q15	943	797	799	797
q16	763	728	675	675
q17	718	870	416	416
q18	6118	5331	5183	5183
q19	1195	972	611	611
q20	510	486	387	387
q21	4654	1965	1477	1477
q22	394	316	278	278
Total cold run time: 97253 ms
Total hot run time: 28774 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4599	4686	4535	4535
q2	q3	1805	2207	1776	1776
q4	853	1186	758	758
q5	4029	4353	4332	4332
q6	181	178	138	138
q7	1787	1666	1546	1546
q8	2539	2688	2565	2565
q9	7682	7370	7229	7229
q10	2738	2844	2502	2502
q11	536	428	413	413
q12	487	593	454	454
q13	4162	4527	3554	3554
q14	280	298	278	278
q15	860	829	793	793
q16	740	754	751	751
q17	1194	1569	1340	1340
q18	7180	6653	6580	6580
q19	900	941	894	894
q20	2087	2147	2012	2012
q21	3960	3575	3386	3386
q22	456	449	396	396
Total cold run time: 49055 ms
Total hot run time: 46232 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 184636 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d97fb1a92f37e163f6f56105a8ab53e89881aaea, data reload: false

query5	4353	647	516	516
query6	345	225	199	199
query7	4223	471	273	273
query8	334	241	234	234
query9	8738	2754	2730	2730
query10	489	379	351	351
query11	17032	16720	16529	16529
query12	188	127	125	125
query13	1259	447	355	355
query14	6313	3239	2972	2972
query14_1	2826	2849	2819	2819
query15	205	195	176	176
query16	1011	453	359	359
query17	1090	712	585	585
query18	2450	434	337	337
query19	212	202	175	175
query20	135	126	130	126
query21	224	137	119	119
query22	4836	6240	5964	5964
query23	17659	17058	16940	16940
query23_1	17285	17002	17104	17002
query24	7579	1666	1271	1271
query24_1	1228	1240	1250	1240
query25	566	481	435	435
query26	1248	261	155	155
query27	2774	470	286	286
query28	4498	1862	1868	1862
query29	827	587	483	483
query30	319	249	213	213
query31	872	731	624	624
query32	82	74	73	73
query33	527	339	296	296
query34	905	904	577	577
query35	653	681	601	601
query36	1122	1111	962	962
query37	133	93	87	87
query38	2959	2879	2841	2841
query39	891	883	839	839
query39_1	831	835	821	821
query40	236	157	138	138
query41	71	64	64	64
query42	106	104	104	104
query43	370	396	363	363
query44	
query45	207	196	183	183
query46	873	1012	621	621
query47	2132	2168	2095	2095
query48	304	333	272	272
query49	647	474	376	376
query50	692	268	221	221
query51	4077	4116	4138	4116
query52	103	103	98	98
query53	287	333	276	276
query54	296	281	265	265
query55	91	82	78	78
query56	299	305	299	299
query57	1364	1331	1266	1266
query58	287	274	273	273
query59	2569	2675	2588	2588
query60	336	335	318	318
query61	147	146	144	144
query62	634	599	546	546
query63	321	280	270	270
query64	4907	1291	1049	1049
query65	
query66	1457	464	354	354
query67	16374	16296	16240	16240
query68	
query69	423	318	290	290
query70	997	920	885	885
query71	343	294	295	294
query72	2783	2705	2414	2414
query73	550	554	312	312
query74	9991	9932	9731	9731
query75	2840	2744	2475	2475
query76	2302	1044	685	685
query77	372	371	302	302
query78	11094	11377	10751	10751
query79	1126	784	596	596
query80	713	631	549	549
query81	496	283	246	246
query82	1355	146	120	120
query83	374	270	243	243
query84	253	120	100	100
query85	841	511	431	431
query86	423	318	294	294
query87	3110	3075	3023	3023
query88	3520	2669	2636	2636
query89	429	379	341	341
query90	1899	175	171	171
query91	167	154	135	135
query92	82	78	69	69
query93	925	827	518	518
query94	451	311	286	286
query95	581	395	311	311
query96	645	509	225	225
query97	2467	2517	2425	2425
query98	234	221	219	219
query99	1016	1005	916	916
Total cold run time: 252938 ms
Total hot run time: 184636 ms

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/14) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.50% (19565/37266)
Line Coverage 36.13% (182618/505455)
Region Coverage 32.51% (141965/436656)
Branch Coverage 33.44% (61505/183903)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (14/14) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.35% (26042/36499)
Line Coverage 54.19% (273083/503960)
Region Coverage 51.79% (228300/440834)
Branch Coverage 53.01% (97809/184495)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 28, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants