Skip to content

chore: Add substr() benchmarks, refactor#20803

Open
neilconway wants to merge 3 commits intoapache:mainfrom
neilconway:neilc/substr-refactor
Open

chore: Add substr() benchmarks, refactor#20803
neilconway wants to merge 3 commits intoapache:mainfrom
neilconway:neilc/substr-refactor

Conversation

@neilconway
Copy link
Contributor

@neilconway neilconway commented Mar 8, 2026

Which issue does this PR close?

N/A

Rationale for this change

I'd like to optimize substr for scalar start/count inputs, but the code would benefit from some refactoring and cleanup first. I also added benchmarks for substr with scalar args.

What changes are included in this PR?

  • Refactor string_view_substr and string_substr to use a single loop
  • Change get_true_start_end to validate its own inputs, cleanup UTF8 path
  • Add benchmark cases for scalar start and count arguments
  • Improve docs

Are these changes tested?

Yes.

Are there any user-facing changes?

No, other than an error message wording change.

AI usage

Multiple AI tools were used to iterate on this PR. I have reviewed and understand the resulting code.

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation spark labels Mar 8, 2026
@neilconway neilconway changed the title chore: Add substr benchmarks, refactor chore: Add substr() benchmarks, refactor Mar 8, 2026
argument(
name = "start_pos",
description = "Character position to start the substring at. The first character in the string has a position of 1."
description = "Character position to start the substring at. The first character in the string has a position of 1. If the start position is less than 1, it is treated as if it is before the start of the string and the (absolute) number of characters before position 1 is subtracted from `length` (if given). For example, `substr('abc', -3, 6)` returns `'ab'`."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new behaviour? As far as I know this doesn't match postgresql's behaviour (as described in the doc on get_true_start_end) where negative start is not allowed. I think for PG the recommended approach there is to use the right(..) function, for duckdb it's string slices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Omega359 Thanks for the comment! This is not new behavior. The behavior for negative start values actually matches what PostgreSQL implements (and what the SQL spec dictates), even though personally that behavior doesn't seem particularly useful to me. get_true_start_end does allow negative start values; maybe you're thinking of negative count values, which are not allowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants