-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster parsing: reduce String usage, list-based input rows. #15681
Conversation
Three changes: 1) Reworked FastLineIterator to optionally avoid generating Strings entirely, and reduce copying somewhat. Benefits the line-oriented JSON, CSV, delimited (TSV), and regex formats. 2) In the delimited (TSV) format, when the delimiter is a single byte, split on UTF-8 bytes directly. 3) In CSV and delimited (TSV) formats, use list-based input rows when the column list is provided upfront by the user.
* given a particular {@link InputRowSchema}. Note that {@link RowAdapters#standardRow()} always works, but the | ||
* one returned by this method may be more performant. | ||
*/ | ||
default RowAdapter<InputRow> createRowAdapter(InputRowSchema inputRowSchema) |
Check notice
Code scanning / CodeQL
Useless parameter Note
The only failure is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Three changes:
Reworked FastLineIterator to optionally avoid generating Strings
entirely, and reduce copying somewhat. Benefits the line-oriented
JSON, CSV, delimited (TSV), and regex formats.
In the delimited (TSV) format, when the delimiter is a single byte,
split on UTF-8 bytes directly.
In CSV and delimited (TSV) formats, use list-based input rows when
the column list is provided upfront by the user.
Benchmarks below. Findings:
JsonLineReaderBenchmark
only benefits from change (1), and got a 15% improvement.DelimitedInputFormatBenchmark
withfromHeader: true
benefits from (1) and (2), and got a 22% improvement.DelimitedInputFormatBenchmark
withfromHeader: false
benefits from all three changes, and got a 30% improvement.