Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Rewrite scoring algorithm to support run of consecutive character, fix acronyms and add optimal selection of character. #22

Closed
wants to merge 77 commits into from

Conversation

jeancroy
Copy link

@jeancroy jeancroy commented Jul 9, 2015

This will be available for testing as an option in fuzzy-finder

Screen

Please address problems, comments, and suggestions here:
https://github.com/jeancroy/fuzzaldrin-plus

We start a tuning phase, so anything not quite right is welcome.

@plrenaudin
Copy link

Don't forget that Travis is running on Linux and AppVeyor on Windows -> Different path separators.
(hopefully it's the problem here)

it "returns the result in order", ->
candidates = [
'Find And Replace: Selet All',
'Settings View: Uninstall Packages',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Settings View: View Installed Themes is missing from this list.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because current ranking was not optimal so I did not wanted to set current ranking as spec.
On the other side if I place "optimal" ranking as per @plrenaudin the build will fail...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand there's no "nice to have" spec, it's pass all or the build fail

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pity.

But even without that this version of the patch is still an improvement over the current situation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :) Added support for your usecase in both spec and scoring (implemented as an extra bonus if exact match happens rigth after token separator)

@walles
Copy link

walles commented Jul 9, 2015

Maybe include the test case from #18 in the spec?

@walles
Copy link

walles commented Jul 9, 2015

And the test case from atom/atom#7783 perhaps?

@plrenaudin
Copy link

Brilliant! I hope this will get through!

@jeancroy
Copy link
Author

jeancroy commented Jul 9, 2015

One thing I miss by a long shot is this spec:
https://github.com/atom/fuzzaldrin/blob/master/spec/score-spec.coffee#L6

It only pass because that bypass
https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L18

Which with the new scoring scheme would be considered a bug.
I'd ask permission to change the perfect==2 spec if possible.

Reason is that the various bonus, make it very hard to compute a meaninfull upperbound, especially things that act in the middle of the string like CamelCase, position in string bonus etc.

Also I believe in this project usage a accurate relative score ordering matter more than saying I have a 85% match.

@jeancroy
Copy link
Author

jeancroy commented Jul 9, 2015

I'd also ask permission to remove this line
https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L12

So if one search for "git push", i get token information and be able to do exact match for "git" then for "push". Or even simply for "git push" (because "gitpush" will never exact match "git push")

@mnquintana
Copy link
Contributor

/cc @kevinsawicki

@walles
Copy link

walles commented Jul 9, 2015

@jeancroy, regarding score-spec.coffee, I think the only thing that makes sense is to compare relative scores, and possibly 0 scores (if that's the minimum).

What we want tests for is sort order, and absolute comparisons like the one at https://github.com/atom/fuzzaldrin/blob/master/spec/score-spec.coffee#L6 doesn't really do that, and IMO you should just remove that test together with the bypass at https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L18.

And I'd apply similar reasoning for https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L12; if you can remove it and honestly say that any specs that start failing aren't relevant any more, go for it!

Those lines are there to solve a problem, and if that problem isn't there any more neither should those lines.

@plrenaudin
Copy link

+1 on what @walles said

…tched by either a another space or a slash. Non optional characters are called core.

Strict isMatch is applied only on core characters of query. Soft score have access to the full query for scoring. This replace the regex that strip space from query and increase precision for multi word query eg "git push".

To be able to do that without recomputing core for each item, it is done at the filter() level.
#21

Added suffix bonus to mirror prefix bonus, exact token match will get both bonus.

Now define optional characters as a whitelist instead of a blacklist.
This will better support text with no latin character.
…ild fails)

Added spec to test for that preference

Added spec to confirm we handle issue18
#18

Cleaned up comments
@jeancroy
Copy link
Author

@walles I really appreciate what you do. Despite contributing a considerable amount of code, I'm still learning etiquette of open source.

@walles
Copy link

walles commented Jul 10, 2015

@jeancroy :)

OTOH, all these things are things I'd have said at work as well.

And I do appreciate you both coding and listening. Thanks @jeancroy!

The thing is also that I thought about diving into this myself, but since you're already doing such a great job with it I think my time is better spent reviewing.

@plrenaudin
Copy link

Great job guys, I'm currently testing the fuzzy-finder using this PR. As expected, it's quite slow but OK IMHO.
Just noticed that the highlight of the character is off when there is space in the query.
Edit: also when there is almost no match, I have some files with highlights and others without.
Edit2: also, the CamelCase didn't work completely. I type MCC (upper case) for MyCamelCase and the results somethingmccsomething (lower case) scores higher. MyCamelCase result is just after though.

@walles
Copy link

walles commented Jul 10, 2015

This patch (avoids many calls to toLowerCase()) improves the benchmark results for this PR by about 2x:

diff --git a/src/scorer.coffee b/src/scorer.coffee
index bbbc1f2..bf54696 100644
--- a/src/scorer.coffee
+++ b/src/scorer.coffee
@@ -49,6 +49,9 @@ opt_char_re = /[\ \-\_\xE3\xE0\xE1\xE4\xE2\xE6\u1EBD\xE8\xE9\xEB\xEA\xEC\xED\xEF
 #

 exports.score = score = (subject, query, ignore) ->
+  subject_l = subject.toLowerCase()
+  query_l = query.toLowerCase()
+
   m = query.length + 1
   n = subject.length + 1

@@ -78,7 +81,7 @@ exports.score = score = (subject, query, ignore) ->
       # Score the options
       gapA = gapArow[j] = Math.max(gapArow[j] + we, vrow[j] + wo)
       gapB = Math.max(gapB + we, vrow[j - 1] + wo)
-      align = vd + scoreChar(query, subject, i - 1, j - 1)
+      align = vd + scoreChar(query, query_l, subject, subject_l, i - 1, j - 1)
       vd = vrow[j]

       #Get the best option
@@ -95,7 +98,7 @@ exports.score = score = (subject, query, ignore) ->
   lpos = m - n - 1

   #sustring bonus, start of string bonus
-  if ( p = subject.toLowerCase().indexOf(query.toLowerCase())) > -1
+  if ( p = subject_l.indexOf(query_l)) > -1
     vmax += wex * m * (1.0 + 5.0 / (5.0 + p))

     #sustring happens right after a separator (prefix)
@@ -108,11 +111,13 @@ exports.score = score = (subject, query, ignore) ->

   return vmax

-scoreChar = (query, subject, i, j) ->
+scoreChar = (query, query_l, subject, subject_l, i, j) ->
   qi = query[i]
+  qi_l = query_l[i]
   sj = subject[j]
+  sj_l = subject_l[j]

-  if qi.toLowerCase() == sj.toLowerCase()
+  if qi_l == sj_l

     #Proper casing bonus
     bonus = if qi == sj then wc else 0
@@ -129,13 +134,14 @@ scoreChar = (query, subject, i, j) ->

     #get previous char
     prev_s = subject[j - 1]
+    prev_s_l = subject_l[j - 1]
     prev_q = query[i - 1]

     #match FOLLOW a separator
     return wa + bonus if ( prev_s of sep_map) or ( prev_q of sep_map )

     #match IS Capital in camelCase (preceded by lowercase)
-    return wa + bonus if (sj == sj.toUpperCase() and prev_s == prev_s.toLowerCase())
+    return wa + bonus if (prev_s == prev_s_l and sj == sj.toUpperCase())

     #normal Match, add proper case bonus
     return wm + bonus

Optional are now only space and space like
@jeancroy
Copy link
Author

Hi @walles I like your idea, but I moved the lowercase test out of scoreChar because it was getting a bit ugly with all those arguments, and the whole function was basically an large if lowercase match. Now we gate the call to this function with the lowercase test.

@walles
Copy link

walles commented Jul 10, 2015

I agree your version is more readable. Nice!

@kshnurov
Copy link

"We don't have time for this" and "it's not our priority" after 2 months of waiting and so much people asking. That's why I hate pseudo-opensource - it doesn't matter that you can see the code if you can't fix it.

@jesseleite
Copy link

@kshnurov Hey now... Even open source needs direction. Atom shouldn't blindly merge changes in, especially overhauls like this. If you care about the future of the editor, a core team reviewing PRs is smart. The fact that @jeancroy is doing all this work is an amazing community contribution, and the fact that Atom's core team is honest about their stance is also a good thing.

All that said, I hope fuzzy finder doesn't get ignored for too long! It's one of Atom's biggest flaws IMO.

@jeancroy
Copy link
Author

Honestly 2 weeks of test as an optional feature is not that bad of a compromise. Personally I was going for at least bumping to a major version.

@kshnurov
Copy link

@jesseleite users feedback should be taken into account when the direction is chosen, not ignored. Core team reviewing PRs is smart, core team saying "we won't even review this" on the ready PR for one of the biggest flaw (and you're 100% right on that) is not smart at all.

I've seen that a dozens of times - it starts with "it's not our priority" on a feature/PR everyone asks for and ends up with a dead software no one uses.

@jesseleite
Copy link

@jesseleite users feedback should be taken into account when the direction is chosen, not ignored.

@kshnurov: Absolutely, but I don't think they are saying "we won't even review this".

Yes, maybe make a new node module, and make a PR on fuzzy-finder to add an option to use the new module. Then we can try it out for a while. I say we just do fuzzy-finder for now, and try it for a week or two.

^ Seems to me that they are open to the PR, but not ready to skim and merge without thorough review.

This PR is a behemoth. My 1st beef with Atom is the fuzzy finder. BUT I also want any solution(s) to be well tested and properly implemented, otherwise we're just layering bandaids on top of bandaids. Be a bit more positive and trust the process :)

@benogle
Copy link

benogle commented Sep 23, 2015

but I don't think they are saying "we won't even review this".

Definitely not. I think getting it in as an option is the first step to changing this.

@jeancroy
Copy link
Author

@benogle
I'm a bit stuck at the step where fuzzy finder don't use fuzzaldrin for filtering/scoring
Instead it rely on atom-space-pen-views which has non optional call to the current version.

https://github.com/atom/fuzzy-finder/blob/master/lib/fuzzy-finder-view.coffee#L135
https://github.com/atom/atom-space-pen-views/blob/master/src/select-list-view.coffee#L161

Should I bypass the super ?

@benogle
Copy link

benogle commented Sep 23, 2015

Yeah, I suppose you will need to override populateList

@jeancroy
Copy link
Author

As requested there is now a PR in fuzzy-finder.

@benogle
Copy link

benogle commented Sep 24, 2015

This has been added to atom/master via atom/fuzzy-finder#142. If you build atom from source, it is available via option:

Screen

@benogle
Copy link

benogle commented Sep 24, 2015

Closing this as we can continue the discussion over at https://github.com/jeancroy/fuzzaldrin-plus

@benogle benogle closed this Sep 24, 2015
@jesseleite
Copy link

Thank you <3

@jeancroy
Copy link
Author

jeancroy commented Oct 6, 2015

For those who want to test it, fuzzaldrin-plus will be an option of autocomplete-plus 2.22

@jesseleite
Copy link

Awesome, look forward to it!

@AbdullahAli
Copy link

The alternate scoring option makes the search really slow and "jittery" 👎

@jeancroy
Copy link
Author

Hi @AbdullahAli , can you give an example of query and context ?
Is it fuzzy finder, how many file indexed ?

The benchmak show it'll run anywhere between 2x faster to 2x slower depending on query.
(And for most of the query, on most condition I expect to err on the side of faster.)

Are you using this from a official stable distribution of atom or using some hack needed before ?

@jesseleite
Copy link

I haven't found any performance issues with alternate scoring option. It's been great for me! @jeancroy how do we find out how many files are indexed?

@jeancroy
Copy link
Author

Very first run of fuzzy finder in a project. If you have many file it show a waiting message with number of file so far.
img

But to create such scenario I have to open atom in my document folder or something multi-project like that.

@jesseleite
Copy link

No way to tell index count after it's already indexed? Maybe via console command or something?

@AbdullahAli
Copy link

Hey @jeancroy I have ~36k files

screen shot 2015-11-18 at 16 37 31

This is what it looks like when the alternate scoring option is enabled:
https://vid.me/NBz0

This is what it looks like with it disabled:
https://vid.me/hQUp

In the second video you can see it is smoother and not "jittery" - although the video quality might not be the best, so sorry about that

@jeancroy
Copy link
Author

In both video it show "reindexing project" so that does not helps.
What you see as "jerky" is that you type faster than the time needed to compute a result.

I'll open an issue on fuzzy finder about the debounce delay. (Maybe making a search each 2 char will help reduce pressure)

Btw what kind of system do you have ? There's a lot of ruby user that use this and the difference should not be that large.

@AbdullahAli
Copy link

It is the same effect after the "reindexing project" message goes away.

I am using a macbook pro on yosemite
Uploading Screen Shot 2015-11-18 at 17.19.46.png…

@jeancroy
Copy link
Author

Ok so a recent build? Are the file-name confidential ?
Would you mind sharing the file listing ? Maybe to a gist.

You can save it to a file like so:

  find /Path/To/Project -type f > FilesWithPaths.txt

I'm ok with you redacting out name of company , software, and whatever it need to be comfortable.
Ideally replacing by another string of the same length.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet