Skip to content

Add option to speed up string match/replace with --max-matches#10587

Merged
mqudsi merged 3 commits into
fish-shell:masterfrom
mqudsi:string-max-matches
Jun 30, 2024
Merged

Add option to speed up string match/replace with --max-matches#10587
mqudsi merged 3 commits into
fish-shell:masterfrom
mqudsi:string-max-matches

Conversation

@mqudsi
Copy link
Copy Markdown
Contributor

@mqudsi mqudsi commented Jun 24, 2024

I've often needed a way to get the last bit of performance out of unwieldy completions that involve a lot of string processing (apt completions come to mind, and I ran into it just now with parsing man pages for kldload completions).

Since many times we are looking for just one exact string in the haystack, an easy optimization here is to introduce a way for string match or string replace to early exit after a specific number of matches (typically one) have been found.

Depending on the size of the input, this can be a huge boon. For example, parsing the description from FreeBSD kernel module man pages with

zcat /usr/share/man/man4/zfs.4.gz | string match -m1 '.Nd *'

runs 35% faster with -m1 than without (bearing in mind that most of the time is taken just starting up zcat, parsing the compressed contents, and decompressing the file!), while processing all files under /usr/share/man/man4/*.4.gz in a loop (so a mix of files ranging from very short to moderately long) runs about 10% faster overall with -m1 (for context, in absolute terms that is 0.5s+ faster).

TODOs:

  • Tests and docs to follow if no one has any objections to this feature.

I've often needed a way to get the last bit of performance out of unwieldy
completions that involve a lot of string processing (apt completions come to
mind, and I ran into it just now with parsing man pages for kldload
completions).

Since many times we are looking for just one exact string in the haystack, an
easy optimization here is to introduce a way for `string match` or `string
replace` to early exit after a specific number of matches (typically one) have
been found.

Depending on the size of the input, this can be a huge boon. For example,
parsing the description from FreeBSD kernel module man pages with

    zcat /usr/share/man/man4/zfs.4.gz | string match -m1 '.Nd *'

runs 35% faster with -m1 than without, while processing all files under
/usr/share/man/man4/*.4.gz in a loop (so a mix of files ranging from very short
to moderately long) runs about 10% faster overall with -m1.
@mqudsi mqudsi changed the title Add option to speed up string match/replace with --max-matches Add option to speed up string match/replace with --max-matches Jun 24, 2024
@faho
Copy link
Copy Markdown
Member

faho commented Jun 25, 2024

Prior art in #7495

@mqudsi
Copy link
Copy Markdown
Contributor Author

mqudsi commented Jun 25, 2024

Thanks for digging that out, @faho. I had a vague feeling we'd discussed this before!

I guess the answer to that is that we shouldn't have questioned your judgement on this matter! :D

But more seriously, in that thread

  • I asked for examples of specific cases that would be sped up significantly by this -- answered in the commit message with benchmarks
  • @krobelus suggested the use of subshell output line splicing to limit the input to string in the first place. I actually tried this and found it to be of little benefit in the specific kldload/zcat case (with something like string replace -rf -- PATTERN REPLACEMENT (zcat $path)[1..50]). In this case, the slowest part is actually zcat and not PCRE2 – I didn't want to unnecessarily wait for zcat to complete its job if a match had already been found, but appending | head -n1 afterwards ended up being a net negative (due to the cost of spawning an additional process for the smaller length inputs). Additionally, using index slicing requires making a guess as to how many lines in a match will be found, so it's always going to be a tradeoff between being too conservative and increasing match times or being more rigid and risk missing matches that happened past that.

I would like to point out something that isn't a regression/breaking change with this new feature, but is something worth being aware of in all cases: obviously string match/replace terminating sooner means that the input is not being fully consumed. This actually unlocks more use cases (you can read from stdin to string until n matches are found, then read the remainder with another utility without having to open the fd and read to the same point twice) but at the same time one just needs to be aware that the input fd to string won't be fully consumed (as it would have been if piping naively to string or if using subshell index slicing).

@faho
Copy link
Copy Markdown
Member

faho commented Jun 25, 2024

This actually unlocks more use cases (you can read from stdin to string until n matches are found, then read the remainder with another utility without having to open the fd and read to the same point twice)

Unfortunately no (#7495 (comment)):

However, because the string might read too far (remember how read has to read byte-by-byte?) it's not guaranteed to be anything sensible.

string reads in 1024 byte chunks, and it cannot stuff bytes back into the pipe.

Reading byte-for-byte is much slower, so it's not a good idea.

@mqudsi
Copy link
Copy Markdown
Contributor Author

mqudsi commented Jun 25, 2024

Ah, fair point.

@mqudsi
Copy link
Copy Markdown
Contributor Author

mqudsi commented Jun 27, 2024

Added tests and documentation.

@mqudsi mqudsi added this to the fish next-3.x milestone Jun 27, 2024
@mqudsi mqudsi added enhancement performance Purely performance-related enhancement without any changes in black box output labels Jun 27, 2024
@mqudsi mqudsi merged commit 8b75979 into fish-shell:master Jun 30, 2024
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Jul 7, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement performance Purely performance-related enhancement without any changes in black box output

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants