Score directories based on the keywords searched #264

lefth · 2021-09-17T14:51:53Z

Use word boundary detection and give scores based on keyword matching

I'm using a helper library to implement a unicode algorithm, but I'm also
detecting case changes within a word (from lower to upper case, or no case
to some case, so the "o" in "Documents" doesn't count as a new word).

Words are not searched--rather, the string is searched starting at a word
boundary. That way multi-word sequences will correctly match.

This is a basic solution to #260. There's lots of room for improving the algorithm,
but IMO this is a big UX improvement already.

Some things to consider:

We don't have options to control the case. If smart-case is disabled,
the weights in compute_kw_score need to change.
Right now keyword score totally overpowers the frequency score. The
frequency score is only a tie-breaker. They could be normalized and
weighted so a much better frequency score would win despite a slightly
worse keyword score.
Should we detect word endings for exact matches? I'm not sure it would
give a good user experience. If I frequently access "src9" but not "src",
I don't want "src" to win just because it more exactly matches what I
typed. It's hard to refrain from typing a whole word. This gives an
interesting wrong result with "c" being an ideal match for
"/mnt/c/anything".
The above issue can be mitigated but not solved if we consider digits to start a new word.
I'm testing these changes with a command like cargo r -- query --list --score b.

lefth · 2021-09-25T05:32:39Z

@ajeetdsouza Do you know if some people have really massive directory databases? I optimized this more than might strictly be necessary if it's unlikely anyone has more than a couple hundred directories.

Also related to that: if you don't like the extra complexity of adding a new iterator class and delaying sorting, I can do away with that by sorting in the constructor as before, either by sorting again when the parameters and excludes change, or by removing methods like with_keywords and passing that data to the constructor instead.

ajeetdsouza · 2021-09-26T23:04:55Z

I optimized this more than might strictly be necessary

Optimization would be great, but we should focus on finalizing the algorithm first. I haven't had time to look at this PR yet, but hopefully I'll be able to check it out soon.

On a related note, I tried searching for how similar tools handle a z foo command with directories /foo and /foobar, since that seems to be one of the primary motives of this change.

autojump
The solution given was to manually lower the score of the /foobar directory, or re-run the z command since the current directory is excluded (wting/autojump#533).

z
The algorithm was changed to always select the shortest match if the prefix is common among all results. The author comments that the solution may have been suboptimal (rupa/z#35).
Someone else suggested going to an exact match if there is a query like z foo/, which sounds doable (rupa/z#303).

Since we're already analyzing the querying algorithm, here's another suggestion from #263: for a user named foo, z foo bar should probably prefer /home/foo/dir/foo/bar rather than /home/foo/bar. It's unlikely that the user would actually mention their username when searching for a path.

lefth · 2021-11-26T05:44:05Z

@ajeetdsouza What would it take for you to accept this pull request? Simpler code, smaller changes?

I suggest this change is a key step toward improving the algorithm--improvement can take a faster pace after the search string influences the score, but after that step a lot of tweaking is possible.

This prevents keywords or options from being added after sorting is done. So ajeetdsouza#260 can be implemented more safely.

Keyword-based scoring is currently a noop. Directory filtering is done before scoring, except for a mutating filter that's complex to execute earlier. This is a step toward implemnting ajeetdsouza#260.

I'm using a helper library to implement a unicode algorithm, but I'm also detecting case changes within a word (from lower to upper case, or no case to some case, so the "o" in "Documents" doesn't count as a new word). Words are not searched--rather, the string is searched starting at a word boundary. That way multi-word sequences will correctly match. This is a basic solution to ajeetdsouza#260. Some things to consider: - We don't have options to control the case. If smart-case is disabled, the weights in compute_kw_score need to change. - Right now keyword score totally overpowers the frequency score. The frequency score is only a tie-breaker. They could be normalized and weighted so a much better frequency score would win despite a slightly worse keyword score. - Should we detect word endings for exact matches? I'm not sure it would give a good user experience. If I frequently access "src9" but not "src", I don't want "src" to win just because it more exactly matches what I typed. It's hard to refrain from typing a whole word. This gives an interesting wrong result with "c" being a perfect match for "/mnt/c/anything". - The above issue can be solved if we consider digits to start a new word. - I'm testing these changes with `cargo r -- query --list --score b`.

…er than other matches.

So when searching 'd', ~/documents matches better than ~/my-documents.

A future option should be to turn off smart case, but that's not a priority, for the reasons mentioned here: ajeetdsouza#224

lefth · 2021-11-27T03:45:14Z

BTW, this will obviously still need tweaking after it's released. There's no way to make it perfect at the first go without a lot of beta testers. To tweak, I'm thinking we could add to add this to the help string:

If a query takes you to a directory it shouldn't (if the smart logic for picking a directory isn't smart enough),
please comment on with this query and the first few lines of output, so we can see directory scores
and judge if the matcher needs tweaking: zoxide --version && zoxide query -ls <the query>

ajeetdsouza · 2021-11-28T19:43:15Z

Hey @lefth, sorry for the delay, I've been having a very busy month! I took a look at the changes, I put in my thoughts below.

I don't think there's any need to change the DirDisplayScore implementation. The purpose of including score in the output is so that users can implement custom functions in their shell configurations -- essentially, zoxide query -ls | sort should sort results by score. Thus, we should ideally change the score function to return the frecency combined with the match score.

This gives an interesting wrong result with c being an ideal match for /mnt/c/anything.

I think that requiring the last component of the path to match the last component of the query works really well, and should not be removed. If you really want to match the highest ranked subdirectory under foo, you can always use a query like z foo /.

Part of this discussion was inspired by the fact that there is no way in zoxide (or in fact, any other autojumper) to cd into /foo if there was a directory called /foobar that was more highly ranked. Are we tackling this problem in any way?

The one way I can think of is to improve the interactive search feature to the point that it becomes a seamless part of one's workflow. This is why I tried to prioritize #257 -- I wanted a user to simply press <SPACE><TAB> and get to interactive selection. This is now available on bash and fish, and I'm bringing it to zsh once I get the time.

I've mentioned this elsewhere, but one of the things I really like about zoxide is the predictability of the algorithm -- even when it's wrong, it's usually not hard to understand why and adapt the query accordingly. The more complex the algorithm becomes, the harder it becomes for people to understand why zoxide picked the directory it did, and the more frustrating it gets when zoxide jumps to the wrong directory.

Ideally, I'd think a good query algorithm would work the same way the current algorithm does, but would assign some weight (in descending order) to-

exact matches of components (split by /; example: component is foo and query is foo)
left-side matches of components (example: component is foobar and query is foo)
exact matches of words (split by unicode word separators)
left-hand matches of words
regular matches (the way they are now)

Still more ambitious is smartcase and unicode normalization. After this, we could sum up the weights and find the one with the highest score. I'm not sure how this would be implemented, but what do you think of the general idea?

ajeetdsouza · 2022-09-06T07:52:41Z

Closing due to inactivity, feel free to open an issue to discuss this further.

ajeetdsouza force-pushed the main branch 3 times, most recently from 8325b84 to 9176634 Compare September 28, 2021 22:36

ajeetdsouza mentioned this pull request Oct 17, 2021

Smart case search #224

Open

ajeetdsouza force-pushed the main branch 12 times, most recently from 9899728 to ba7b445 Compare November 1, 2021 16:41

lefth added 3 commits November 26, 2021 13:45

Make new iterator to delay sorting

d55f827

This prevents keywords or options from being added after sorting is done. So ajeetdsouza#260 can be implemented more safely.

Preliminary: score directories based on search keywords

6eba7d0

Keyword-based scoring is currently a noop. Directory filtering is done before scoring, except for a mutating filter that's complex to execute earlier. This is a step toward implemnting ajeetdsouza#260.

lefth force-pushed the main branch from d6f5db0 to 21886e5 Compare November 26, 2021 05:45

lefth added 4 commits November 26, 2021 13:51

Update documentation and completion strings: describe the score format.

2313661

Matches in the last path component should be considered a little bett…

f94db84

…er than other matches.

Consider leftmost matching within a component to be best

8d59bd5

So when searching 'd', ~/documents matches better than ~/my-documents.

Implement smart case matching

a260d92

A future option should be to turn off smart case, but that's not a priority, for the reasons mentioned here: ajeetdsouza#224

ajeetdsouza force-pushed the main branch from 592b2b6 to 3e03ccc Compare December 1, 2021 19:25

ajeetdsouza force-pushed the main branch 2 times, most recently from d365bfd to 58430d8 Compare December 5, 2021 09:21

ajeetdsouza force-pushed the main branch 7 times, most recently from b2f049d to fd088b4 Compare December 24, 2021 10:50

ajeetdsouza force-pushed the main branch 2 times, most recently from de16c49 to 1736571 Compare March 9, 2022 14:29

ajeetdsouza force-pushed the main branch 3 times, most recently from dc7b300 to 3620189 Compare April 25, 2022 14:26

ajeetdsouza force-pushed the main branch from 7d9aae5 to 4bf470d Compare May 19, 2022 11:42

ajeetdsouza force-pushed the main branch 3 times, most recently from 79229c4 to 3df60eb Compare June 25, 2022 19:52

ajeetdsouza closed this Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Score directories based on the keywords searched #264

Score directories based on the keywords searched #264

lefth commented Sep 17, 2021 •

edited

lefth commented Sep 25, 2021

ajeetdsouza commented Sep 26, 2021 •

edited

lefth commented Nov 26, 2021

lefth commented Nov 27, 2021

ajeetdsouza commented Nov 28, 2021 •

edited

ajeetdsouza commented Sep 6, 2022

Score directories based on the keywords searched #264

Score directories based on the keywords searched #264

Conversation

lefth commented Sep 17, 2021 • edited

lefth commented Sep 25, 2021

ajeetdsouza commented Sep 26, 2021 • edited

lefth commented Nov 26, 2021

lefth commented Nov 27, 2021

ajeetdsouza commented Nov 28, 2021 • edited

ajeetdsouza commented Sep 6, 2022

lefth commented Sep 17, 2021 •

edited

ajeetdsouza commented Sep 26, 2021 •

edited

ajeetdsouza commented Nov 28, 2021 •

edited