New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native findWordsWithSubsequence method that can be used to optimize autocomplete #22

Merged
merged 22 commits into from Sep 28, 2017

Conversation

Projects
None yet
5 participants
@nathansobo
Contributor

nathansobo commented Aug 7, 2017

This PR is intended to enable optimization of our standard symbol-based providers in autocomplete-plus. These providers currently perform expensive up-front indexing of all the tokens in all open buffers after tokenization is completed. During actual auto-completion, the providers then perform a full subsequence match and scoring routine against all tokens on the main thread, which can visibly block the UI for buffers with large numbers of tokens.

This PR introduces a new low-level API on which we can build a better alternative. It does the heavy lifting of subsequence matching in C++ on a background thread.

findWordsWithSubsequence(query, extraWordCharacters, maxCount)

  • query A string containing a subsequence to match against.
  • extraWordCharacters Any extra non-alphanumeric characters that should be considered to be part of words, such as _, $, or -. These should depend on the language.
  • maxCount The maximum number of subsequence matches to return.

Returns an array of subsequence matches, each with the following properties:

  • word: The word that matched the subsequence.
  • positions: An array of {row, column} objects where this word occurs in the buffer.
  • matchIndices: An array of integers indicating the positions in the word where characters from the subsequence query matched.
  • score An integer reflecting how well this subsequence matched based on heuristics derived from this blog post.

Example

// assuming this code runs in an async function...

const buffer = new TextBuffer(
  'banana bandana ban_ana bandaid band bNa\nbanana'
)
const result = await buffer.findWordsWithSubsequence('bna', '_', 4)
assert.deepEqual(result, [
  {
    score: 30,
    matchIndices: [0, 1, 2 ],
    positions: [{row: 0, column: 36}],
    word: "bNa"
  },
  {
    score: 16,
    matchIndices: [0, 2, 4],
    positions: [{row: 0, column: 15}],
    word: "ban_ana"
  },
  {
    score: 12,
    matchIndices: [0, 2, 3],
    positions: [{row: 0, column: 0}, {row: 1, column: 0}],
    word: "banana"
  },
  {
    score: 10,
    matchIndices: [0, 5, 6],
    positions: [{row: 0, column: 7}],
    word: "bandana"
  }
])

How this can be used in autocomplete-plus

Instead of indexing all the symbols up front based on their syntactic scope, we can instead find the syntactic scope on an as needed basis. Whenever we need to autocomplete, we call this method on all open buffers. It will report the top N matches. Ideally, we then query the current syntax tree / tokenized buffer for the syntactic scope of the top N matches on the fly based on their location(s).

I'm putting this out there for now in case someone wants to pick up using this is autocomplete-plus, but don't have immediate plans of pushing this forward.

Remaining tasks

  • Core API
  • Node bindings
  • Asm.js bindings

/cc @maxbrunsfeld @wvanlint @iolsen

@nathansobo nathansobo requested a review from maxbrunsfeld Aug 7, 2017

@maxbrunsfeld

This looks super nice. Can't wait to use it.

vector<SubsequenceMatch> matches;
for (auto entry : substring_matches) {

This comment has been minimized.

@leroix

leroix Sep 12, 2017

Contributor

I had to bust out a pen and paper to wrap my head around this algorithm, but it seems to work. You may want to think about this as some function that takes a word and a query as input and returns a score and match indices as output. Then, any number of scoring functions with the same type signature can be plugged-in.

Word -> Query -> ([MatchIndex], Score)

@leroix

leroix Sep 12, 2017

Contributor

I had to bust out a pen and paper to wrap my head around this algorithm, but it seems to work. You may want to think about this as some function that takes a word and a query as input and returns a score and match indices as output. Then, any number of scoring functions with the same type signature can be plugged-in.

Word -> Query -> ([MatchIndex], Score)

This comment has been minimized.

@nathansobo

nathansobo Sep 12, 2017

Contributor

Presumably you get the part where we explore every possible permutation of the query matching against the word. We could definitely extract a function if you'd like. In general our style has drifted toward waiting for a second use before extracting logic. While extracting can aid clarity locally, it also reduces clarity globally because now you have to reason about whether the extracted function has multiple callers.

@nathansobo

nathansobo Sep 12, 2017

Contributor

Presumably you get the part where we explore every possible permutation of the query matching against the word. We could definitely extract a function if you'd like. In general our style has drifted toward waiting for a second use before extracting logic. While extracting can aid clarity locally, it also reduces clarity globally because now you have to reason about whether the extracted function has multiple callers.

Show outdated Hide outdated src/core/text-buffer.cc Outdated
if (i == 0 ||
!std::iswalnum(word[i - 1]) ||
(std::iswlower(word[i - 1]) && std::iswupper(word[i]))) {

This comment has been minimized.

@leroix

leroix Sep 12, 2017

Contributor

It seems like if not for this rule, there could be a function that takes any list of match indices and computes the score, but with this rule, I guess you also need to know whether each index corresponds to an upper or lower-case letter.

@leroix

leroix Sep 12, 2017

Contributor

It seems like if not for this rule, there could be a function that takes any list of match indices and computes the score, but with this rule, I guess you also need to know whether each index corresponds to an upper or lower-case letter.

@leroix leroix referenced this pull request Sep 13, 2017

Merged

add subsequence provider #886

4 of 4 tasks complete
@leroix

This comment has been minimized.

Show comment
Hide comment
@leroix

leroix Sep 14, 2017

Contributor

I'm experiencing some crashes when using this in atom/autocomplete-plus#886 in dev mode. I'll report back when I have more information.

Contributor

leroix commented Sep 14, 2017

I'm experiencing some crashes when using this in atom/autocomplete-plus#886 in dev mode. I'll report back when I have more information.

@leroix

This comment has been minimized.

Show comment
Hide comment
@leroix

leroix Sep 14, 2017

Contributor

I have a crash log to share. The assertion in the snapshot destructor seems to be failing. This occurred while running the autocomplete-plus package specs focused on just the subsequence provider specs.

https://gist.github.com/leroix/ceebd504263164de954908db5b7e2df0

Contributor

leroix commented Sep 14, 2017

I have a crash log to share. The assertion in the snapshot destructor seems to be failing. This occurred while running the autocomplete-plus package specs focused on just the subsequence provider specs.

https://gist.github.com/leroix/ceebd504263164de954908db5b7e2df0

@leroix

This comment has been minimized.

Show comment
Hide comment
@leroix

leroix Sep 15, 2017

Contributor

I'm taking a stab at debugging this. It's definitely reproducible by using findWordsWithSubsequence in a 1000x loop like you mentioned @maxbrunsfeld. It seems that we have a dangling pointer when the Snapshot destructor gets called. TextBuffer::consolidate_layers runs and top_layer->previous_layer points to a layer that can't be read. I'm going to keep investigating, but let me know if you have any hunches.

Contributor

leroix commented Sep 15, 2017

I'm taking a stab at debugging this. It's definitely reproducible by using findWordsWithSubsequence in a 1000x loop like you mentioned @maxbrunsfeld. It seems that we have a dangling pointer when the Snapshot destructor gets called. TextBuffer::consolidate_layers runs and top_layer->previous_layer points to a layer that can't be read. I'm going to keep investigating, but let me know if you have any hunches.

fix intermittent `findWordsWithSubsequence` crashes
I whittled down the `find` method and `find_words_with_subsequence` method
until they were practically the same, and `find_words_with_subsequence` was
still crashing while `find` was not. Then, I started looking at their
use and `index.js`, and the only thing I could see that was definitely
different was that `findWordsWithSubsequence` was passing the promise's
`resolve` function as the callback to the `findWordsWithSubsequence`
binding. I changed it to pass an arrow function that calls `resolve`, and the test
passed.

I would really love to know why this fix works if anyone has any
insight.
@leroix

This comment has been minimized.

Show comment
Hide comment
@leroix

leroix Sep 17, 2017

Contributor

I found a fix #31, but I'm not sure why it works.

Contributor

leroix commented Sep 17, 2017

I found a fix #31, but I'm not sure why it works.

@leroix

This comment has been minimized.

Show comment
Hide comment
@leroix

leroix Sep 21, 2017

Contributor

Thoughts on merging #31 and #32 into this branch and merging this branch?

Contributor

leroix commented Sep 21, 2017

Thoughts on merging #31 and #32 into this branch and merging this branch?

@maxbrunsfeld

This comment has been minimized.

Show comment
Hide comment
@maxbrunsfeld

maxbrunsfeld Sep 21, 2017

Contributor

@leroix Go for it 🚢 .

Contributor

maxbrunsfeld commented Sep 21, 2017

@leroix Go for it 🚢 .

@leroix

This comment has been minimized.

Show comment
Hide comment
@leroix

leroix Sep 21, 2017

Contributor

I don't have the power :)

Contributor

leroix commented Sep 21, 2017

I don't have the power :)

@leroix

This comment has been minimized.

Show comment
Hide comment
@leroix

leroix Sep 21, 2017

Contributor

I'd say first merge the two fixes and make sure CI passes--it should.

Contributor

leroix commented Sep 21, 2017

I'd say first merge the two fixes and make sure CI passes--it should.

Ben3eeE added some commits Sep 21, 2017

Merge pull request #31 from leroix/ns-mb-substring-match-crash-fix
fix intermittent `findWordsWithSubsequence` crashes
Merge pull request #32 from leroix/ns-mb-substring-match-fix-browser-…
…build

add support for `findWordsWithSubsequence` in the browser build
@Ben3eeE

This comment has been minimized.

Show comment
Hide comment
@Ben3eeE

Ben3eeE Sep 21, 2017

Member

@leroix I merged #31 and #32

Member

Ben3eeE commented Sep 21, 2017

@leroix I merged #31 and #32

@iolsen

This comment has been minimized.

Show comment
Hide comment
@iolsen

iolsen Sep 21, 2017

I don't have the power :)

✔️ Fixed!

iolsen commented Sep 21, 2017

I don't have the power :)

✔️ Fixed!

maxbrunsfeld added some commits Sep 21, 2017

Avoid writing to invalid reference in find_words_with_subsequence
Signed-off-by: Justin Ratner <leroix08@gmail.com>
Maintain explicit reference to buffer in FindWordsWithSubsequence worker
Signed-off-by: Justin Ratner <leroix08@gmail.com>
Fix subsequence match score in native test
Signed-off-by: Justin Ratner <leroix08@gmail.com>
@leroix

This comment has been minimized.

Show comment
Hide comment
@leroix

leroix Sep 21, 2017

Contributor

After talking to @maxbrunsfeld , I think we're going to change the native function to find_words_with_subsequence_in_range and expose a findWordsWithSubsequenceInRange and findWordsWithSubsequence (just searches the whole range) in node. That will help us deal with pathological cases in autocomplete-plus where people have very large files open. I'm also going to look at how find_words_with_subsequence performs on very large buffers.

Contributor

leroix commented Sep 21, 2017

After talking to @maxbrunsfeld , I think we're going to change the native function to find_words_with_subsequence_in_range and expose a findWordsWithSubsequenceInRange and findWordsWithSubsequence (just searches the whole range) in node. That will help us deal with pathological cases in autocomplete-plus where people have very large files open. I'm also going to look at how find_words_with_subsequence performs on very large buffers.

@leroix

This comment has been minimized.

Show comment
Hide comment
@leroix

leroix Sep 22, 2017

Contributor

Here's some data related to my previous comment. This pulls down a 51MB text file, uses it to construct a buffer and times how long findWordsWithSubsequence takes.

~/github/superstring/benchmark (ns-mb-substring-match *)$ node large-text-buffer.benchmark.js 
fetching text file...
running findWordsWithSubsequence...
Time to find "cat" in 51MB file: 856.712ms

This would potentially block displaying suggestions making autocomplete feel really laggy. Fortunately, it wouldn't block the main thread.

Contributor

leroix commented Sep 22, 2017

Here's some data related to my previous comment. This pulls down a 51MB text file, uses it to construct a buffer and times how long findWordsWithSubsequence takes.

~/github/superstring/benchmark (ns-mb-substring-match *)$ node large-text-buffer.benchmark.js 
fetching text file...
running findWordsWithSubsequence...
Time to find "cat" in 51MB file: 856.712ms

This would potentially block displaying suggestions making autocomplete feel really laggy. Fortunately, it wouldn't block the main thread.

@maxbrunsfeld maxbrunsfeld merged commit 8f8594e into master Sep 28, 2017

0 of 2 checks passed

ci/circleci CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details

@maxbrunsfeld maxbrunsfeld deleted the ns-mb-substring-match branch Sep 28, 2017

@maxbrunsfeld

This comment has been minimized.

Show comment
Hide comment
@maxbrunsfeld

maxbrunsfeld Sep 28, 2017

Contributor

⚡️ Thanks for getting this over the finish line @leroix

Contributor

maxbrunsfeld commented Sep 28, 2017

⚡️ Thanks for getting this over the finish line @leroix

@iolsen

This comment has been minimized.

Show comment
Hide comment
@iolsen

iolsen commented Sep 28, 2017

Nice! 💯 @leroix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment