Skip to content

Fix performance regression in split by avoiding allocating substring per char#237

Merged
stephenamar-db merged 3 commits intodatabricks:masterfrom
JoshRosen:fix-split-perf-regression
Dec 12, 2024
Merged

Fix performance regression in split by avoiding allocating substring per char#237
stephenamar-db merged 3 commits intodatabricks:masterfrom
JoshRosen:fix-split-perf-regression

Conversation

@JoshRosen
Copy link
Copy Markdown
Contributor

@JoshRosen JoshRosen commented Dec 12, 2024

This PR fixes a performance regression from #227 / 4c85bde which I overlooked in review:

When generalizing the optimized non-Pattern-based split code, that commit introduced a .substring() on each character, producing tons of garbage.

Instead, I think we can do a .startsWith(splitPattern, i): this should be much faster because it will avoid unnecessary garbage string creation (plus I'm pretty sure that startsWith is optimized in modern JDKs).

I also removed the use of breakable and replaced it with an update to the while condition.

@JoshRosen JoshRosen changed the title Fix performance regression in splitLimit by avoiding allocating substring per char Fix performance regression in split by avoiding allocating substring per char Dec 12, 2024
@stephenamar-db stephenamar-db merged commit 680b1a8 into databricks:master Dec 12, 2024
@JoshRosen JoshRosen deleted the fix-split-perf-regression branch December 31, 2024 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants