Skip to content
This repository has been archived by the owner on Sep 6, 2021. It is now read-only.

Quick Open heuristic should prefer contiguous substrings #2068

Closed
njx opened this issue Nov 6, 2012 · 17 comments
Closed

Quick Open heuristic should prefer contiguous substrings #2068

njx opened this issue Nov 6, 2012 · 17 comments
Assignees
Milestone

Comments

@njx
Copy link
Member

njx commented Nov 6, 2012

  1. Open the brackets source folder (as of b455ec3)
  2. Cmd-Shift O
  3. Type spec/live

Result: The highlighting in the first few entries suggests they're being sorted to the top because of the "v" and "e" being scattered throughout the path, which seems weird. It seems like the heuristic should prefer the contiguous "live" at the beginning of the path entry.

@njx
Copy link
Member Author

njx commented Nov 6, 2012

Another example: do a quick open for samples/index.html. The first hit is for src/thirdparty/CodeMirror2/mode/ntriples/index.html, but it seems like ones that actually start with "samples" should be a higher priority.

@ghost ghost assigned peterflynn Nov 6, 2012
@njx
Copy link
Member Author

njx commented Nov 6, 2012

Assigning to @peterflynn

@njx
Copy link
Member Author

njx commented Nov 7, 2012

Actually, I think the reason this happens is because we walk right-to-left, on the theory that the rightmost segments are more specific. That makes sense at the segment level, but I don't think it makes sense at the character level. What if we walked right-to-left through segments, but within each segment we walk left-to-right?

@njx
Copy link
Member Author

njx commented Nov 7, 2012

Reviewed.

@njx
Copy link
Member Author

njx commented Nov 8, 2012

Bumping to medium priority--this is really breaking some common cases. For example, if you do Quick Open on "Commands", you get "CommandManager.js" as the first hit instead of "Commands.js".

@njx
Copy link
Member Author

njx commented Nov 10, 2012

Nominating for sprint 17 since the new heuristic has slightly worse behavior in some cases.

@peterflynn
Copy link
Member

Yeah... the heuristic is definitely improved on the whole (the robustness to typos alone is worth it!), but there are definitely a lot of little cases where the top hit isn't what I'd expect. Here are some more examples:

  • Searching for "extensions" lists ExtensionLoader.js above items with the contiguous string "extensions" in their path (seems like the last-segment boost is outweighing the longer chain of contiguous chars)
  • Searching "@​Doc" in DocumentManager doesn't list Document first. Matches starting at the start of a segment should probably get a bigger boost.
  • Searching "@​get" in WorkingSetSort lists _handleSortWorkingSetByType before "get" (exact match) and "getCommandID" & "getCompareFn" (exact contiguous prefixes). Seems like the bonus for uppercase matches is outweighing the bonus for contiguous matches. (Same if you search "@​init" in Menus.js -- init() is listed last, after three highly discontiguous matches that include an uppercase "I". Or if searching for "DMan": "DOMAgent" ranks higher than "DocumentManager").
  • "EUtil" lists EditorUtils first, but ExtensionUtils 6th, behind FileUtils/CodeHintUtils/LESSUtils. Seems like discontiguous matches should be penalized much less when the gap includes a separator (capitalization change, non-word char, etc.).
  • We should consider favoring shorter strings, all else being equal. E.g. "menus" favors "Menus-test.js" over "Menus.js", and "root/strings" will favor the strings.js buried deep under src/extensions/samples rather than the one closer to the top level in src/nls.

I like the idea of walking left-to-right within each segment, but I don't think it would fix any of these ranking problems (nor the ones NJ mentioned above). It does seem worth fixing, since it leads to odd highlighting (e.g. "menus" highlights "Menus.js" when you'd expect "Menus.js"). But it seems lower priority.

Also note that regardless of direction, because the matching is greedy there will always be cases where we fail to find the longest matching substring. E.g. left-to-right searching for "abcd" in "abcxxxabcd" would yield "abcxxxabcd" which isn't optimal either.

@njx
Copy link
Member Author

njx commented Nov 26, 2012

Moving out to sprint 18. Somewhat risky, not an official feature.

@ghost ghost assigned dangoor Dec 12, 2012
@dangoor
Copy link
Contributor

dangoor commented Dec 20, 2012

I've probably spent more time than was desired/anticipated on this, and I'm closer to a "solution" but not there yet. I put "solution" in quotes, because I don't think it's really possible to have the first matched item always be the one you'd anticipate, but it can be the case most of the time.

We can probably make things better by tweaking the scoring, but I think relying purely on right-to-left walking is going to leave us with many cases that feel funny.

My first thought was to find the longest contiguous substring and build a result around that. Unfortunately, finding the longest contiguous substring is slow (not just in running time complexity but in actual observed time on my machine). Beyond that, while I think having the longest contiguous substring would make things better, I think we'd still end up with suboptimal cases.

I observed what other editors do and found a rough comparison order that I like. I say "rough" because I haven't implemented it yet, and I know there are little gremlins. I believe that this would produce fundamentally better results than what we have now.

  1. split the string up into segments (split on "/")
  2. compare each character of the query string preferring matches in this order:
    1. start with last (most-specific) segment
    2. compare first character
    3. compare other special characters (first capital letter after a lower case, to help with camelCase, ".", or "-")
    4. failing that, leave the first segment and compare the first+special characters of the other segments
    5. failing that, start from the left and compare the non-special characters
  3. for the next character of the query string, you only search for characters that are sequentially after that first one
  4. I'm leaning toward checking first for contiguous strings and then following the order outlined in step 2
  5. if it tries to cram everything into the last segment and fails, it would need to start over not preferring the last segment

I don't know if I'll have this done in time for the end of the sprint, especially given review.

@pthiess
Copy link
Contributor

pthiess commented Dec 20, 2012

I think it would be ok to move this out to sprint 19. Good feedback, Kevin!

@njx
Copy link
Member Author

njx commented Dec 20, 2012

Good thinking here. I think the general approach of right-to-left for the whole list of segments and left-to-right-with-contiguity for individual segments makes sense. (I'm assuming that by "split the string up into segments", you're talking about the string to be matched, not the query.) A couple of other thoughts:

  • If the user types a query that itself has multiple segments (e.g. "src/index.html"), how should we deal with that? I'm guessing we should basically apply the steps you outline above for the last segment of their query. For the immediately previous segment of their query, apply the same steps except starting at the second-to-last segment of the match string. Then maybe add some weighting factor that prefers "contiguous segments" (i.e., if they type "src/index.html", prefer "foo/src/index.html" over "src/foo/index.html").
  • I wonder if it would be worth vetting the heuristics against a decent-sized set of representative use cases that we think will be most common. As you pointed out, we'll never get every case right, but we might want to try to ensure that a certain set of use cases are predictably correct. (I don't even know that we need to do this manually right now--we could just essentially build those cases as a set of unit tests, try the heuristics against them, and then keep tweaking the heuristics until they get them right.)

@dangoor
Copy link
Contributor

dangoor commented Dec 20, 2012

@pthiess I'm making good progress now (TDD FTW in a case like this), but yeah, this looks like a "land early in sprint 19" thing given the timing. I am hoping to have a pull request up today.

@njx:

I don't know for sure yet, but I don't think we need to account for multiple segments in the query. "/" is considered a "special character" and automatically gets special treatment when searching and scoring in my new algorithm.

I totally agree with having a decent set of representative cases and I already have a test function that takes a query and a list of strings to test and verifies that the strings are scored to appear in the order that the test provides. I will feed that a bunch of cases (you and @peterflynn have provided a bunch of great ones here) to ensure sane results.

@pthiess
Copy link
Contributor

pthiess commented Dec 20, 2012

@dangoor I moved it to Sprint 19 and deleted it from the Trello card for Sprint 18

@dangoor
Copy link
Contributor

dangoor commented Dec 21, 2012

Another update: yesterday, I got the matching working and now I have the scoring working well. The results seem much better than the old algorithm.

There's one bug left to fix and I need to write a lot of comments.

@njx
Copy link
Member Author

njx commented Dec 21, 2012

Sweet! Looking forward to it.

peterflynn added a commit that referenced this issue Jan 18, 2013
Fix for #2068: better QuickOpen heuristics
@ghost ghost assigned njx Jan 18, 2013
@peterflynn
Copy link
Member

FBNC @njx

@njx
Copy link
Member Author

njx commented Jan 18, 2013

Looks great. The only case that I found that seems a little weird to me is "cm.js" matches "CodeHintManager.js" before "CommandManager.js"--I would expect the latter to come first because the two uppercase characters are contiguous in the query. But we're never going to make Quick Open read my mind in every case :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants