[fix](csv reader) fix incorrect column parsing when using enclose for CSV files with UTF-8 BOM#60864
Open
sollhui wants to merge 1 commit intoapache:masterfrom
Open
[fix](csv reader) fix incorrect column parsing when using enclose for CSV files with UTF-8 BOM#60864sollhui wants to merge 1 commit intoapache:masterfrom
sollhui wants to merge 1 commit intoapache:masterfrom
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
8bf3090 to
d97fb1a
Compare
Contributor
Author
|
run buildall |
TPC-H: Total hot run time: 28774 ms |
TPC-DS: Total hot run time: 184636 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
|
PR approved by at least one committer and no changes requested. |
Contributor
|
PR approved by anyone and no changes requested. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
When reading CSV files with UTF-8 BOM (Byte Order Mark) and
enclosecharacter enabled(e.g.,
enclose = '"'), the column names and data values are parsed incorrectly.Root Cause
In enclose mode,
EncloseCsvLineReaderCtxpre-computescolumn_sep_positions(absolutebyte offsets of column separators) during
read_line(). These positions are calculated onthe raw line data including the 3-byte BOM (
0xEF 0xBB 0xBF).Later,
CsvReader::_remove_bom()shifts the data pointer forward by 3 bytes, but thepre-computed
column_sep_positionsare not adjusted accordingly. WhenEncloseCsvTextFieldSplitter::do_split()uses these stale positions on the shifted pointer,all field boundaries are off by 3 bytes, resulting in corrupted column names and data.
This bug does not affect the non-enclose mode, because
PlainCsvTextFieldSplitterscans the data on-the-fly rather than relying on pre-computed positions.
Fix
adjust_column_sep_positions(size_t offset)toEncloseCsvLineReaderCtxto subtractthe given offset from all pre-computed separator positions.
EncloseCsvLineReaderCtxreference inCsvReaderwhen enclose mode is active._remove_bom()when BOM is detected, so all call sites(
_parse_col_names,_parse_col_nums,get_next_block) are automatically fixed.