Too slow on large real world codebases #2668

Closed
oblitum opened this Issue May 30, 2017 · 22 comments

Comments

Projects
None yet
4 participants
@oblitum
Contributor

oblitum commented May 30, 2017

Consider the following trivial c++ program:

Case 1

#include <iostream>
#include <boost/bimap.hpp>

struct S { int x; };

int main() {
    using namespace std;
    using namespace boost;

    S s;
    s.x;
}

Automatic semantic completion just after the member access operator will work, but forced semantic completion just before it will not. This happens because YCM is unable to handle large numbers of identifiers.

Forced semantic completion before . makes libclang return global scope completion data. libclang is very fast in doing so, YCM chokes. Two alternatives can happen:

  • It will always timeout [fixed by #2657]
  • Results come up, but are very slow

Things that always happen:

  • High CPU consumption
  • Typing lag (Vim hangs) [fixed by #2657]

Completion at global scope is a very important use case, specially in C codebases where nearly everything lives at the global scope. For example:

Case 2

#include <windows.h>

struct S { int x; };

int main() {
    S s;
    s.x;
}

It's a very common use case to wish to have completion for the Windows API, which is all C functions. In this realm the most useful and expected thing that YCM could offer is completion of the API provided by the included header: it simply can't handle it, despite how ubiquitous windows.h is.

Program 1 below was used to verify bare libclang timings for these usecases (with LIBCLANG_TIMING=1 the timings can be compared with the internal libclang timings).

For case 1 after . (#define flags linux_flags):

❯❯❯ ./libclang-timing case1.cpp 11 7
>>> First run: 5 results in 4ms
>>> Second run: 5 results in 3ms

>>> AFTER REPARSE

>>> Third run: 5 results in 3ms
>>> Fourth run: 5 results in 3ms

For case 1 before . (#define flags linux_flags):

❯❯❯ ./libclang-timing case1.cpp 11 6
>>> First run: 12989 results in 57ms
>>> Second run: 12989 results in 59ms

>>> AFTER REPARSE

>>> Third run: 13123 results in 39ms
>>> Fourth run: 13123 results in 37ms

For case 2 after . (#define flags windows_flags):

❯❯❯ ./libclang-timing case2.cpp 7 7
>>> First run: 5 results in 1ms
>>> Second run: 5 results in 1ms

>>> AFTER REPARSE

>>> Third run: 5 results in 1ms
>>> Fourth run: 5 results in 1ms

For case 2 before . (#define flags windows_flags):

❯❯❯ ./libclang-timing case2.cpp 7 6
>>> First run: 43304 results in 106ms
>>> Second run: 43304 results in 99ms

>>> AFTER REPARSE

>>> Third run: 43340 results in 3ms
>>> Fourth run: 43340 results in 3ms

Notice that reparsing has been added as I suspect my experience apparently conflicts with this comment:

  // codeCompleteAt reparses the TU if the underlying source file has changed on
  // disk since the last time the TU was updated and there are no unsaved files.
  // If there are unsaved files, then codeCompleteAt will parse the in-memory
  // file contents we are giving it. In short, it is NEVER a good idea to call
  // clang_reparseTranslationUnit right before a call to clang_codeCompleteAt.
  // This only makes clang reparse the whole file TWICE, which has a huge impact
  // on latency. At the time of writing, it seems that most users of libclang
  // in the open-source world don't realize this (I checked). Some don't even
  // call reparse*, but parse* which is even less efficient.

Despite that, ~100ms is reasonably unnoticeable (but alarming compared to ~3ms), it's not libclang's fault. The windows.h use case can be verified from Linux if the tips in this blog post are followed.

It's around ~35k identifiers that YCM starts to choke a lot.

I've experienced YCM slowness before on huge files, and I generally just use vim -u NONE for opening huge files to avoid that. Files that have no semantic completion at all, just identifier completion,
like some .sql file.

With this information in mind, I've created Program 2 to generate output with 60k identifiers. Saving this output to identifiers.sql and opening it with YCM, trying to edit at the end of the file the same problems will happen without any semantic completion:

  • If aaa is typed, YCM will simply timeout [fixed by #2657]
  • If aaaaaa is typed, YCM may be able to complete, but it'll be slow.
  • if backspace is used to delete chars, Vim hangs. Editing in general is slow [fixed by #2657]. CPU consumption is high.

Video demonstration:

System information:

  • Intel® Core™ i7-7700K Processor
  • DDR4 Corsair Vengeance LPX 16GB 2400MHz RAM
  • ArchLinux Operating System
  • gcc 7.1.1
  • libc++ 4.0.0
  • Boost 1.64
  • Enterprise WDK 1703
  • NeoVim 0.2.0

Related issue:


Program 1

#include <chrono>
#include <cstdlib>
#include <iostream>

#include <clang-c/Index.h>

// Compile with:
// g++ -O3 libclang-timing.cpp -o libclang-timing -lclang
// Run with:
// LIBCLANG_TIMING=1 ./libclang-timing file.cpp line column

const char *linux_flags[] = {
    "-x",
    "c++",
    "-pedantic",
    "-std=c++1z",
    "-stdlib=libc++",
    "-isystem",
    "/usr/include/c++/v1",
    "-Wall",
    "-Wextra",
    "-isystem",
    "/usr/local/include",
    "-isystem",
    "/usr/lib/clang/4.0.0/include",
    "-isystem",
    "/usr/include",
};

const char *windows_flags[] = {
    "-x",
    "c++",
    "--target=i386-pc-windows-msvc",
    "-ferror-limit=64",
    "-fms-compatibility-version=19",
    "-Wall",
    "-Wextra",
    "-Wno-unknown-pragmas",
    "-U__clang__",
    "-U__clang_version__",
    "-U__clang_major__",
    "-U__clang_minor__",
    "-U__clang_patchlevel__",
    "-DWIN32",
    "-D_WINDOWS",
    "-DNDEBUG",
    "-D_MT",
    "-D_X86_=1",
    "-DNOMINMAX",
    "-D_WIN32_WINNT=0x0501",
    "-D_CRT_SECURE_NO_WARNINGS=1",
    "-nostdinc",
    "-isystem",
    "/mnt/wdk/1703/program files/windows kits/10/include/10.0.15063.0/shared",
    "-isystem",
    "/mnt/wdk/1703/program files/windows kits/10/include/10.0.15063.0/ucrt",
    "-isystem",
    "/mnt/wdk/1703/program files/windows kits/10/include/10.0.15063.0/um",
    "-isystem",
    "/mnt/wdk/1703/program files/microsoft visual studio 14.0/vc/include",
};

#define flags linux_flags

int main(int argc, const char *argv[]) {
    using std::cout;
    using std::cerr;
    using std::endl;
    using std::chrono::milliseconds;
    using std::chrono::duration_cast;
    using std::chrono::high_resolution_clock;

    if (argc != 4) {
        cout << argv[0] << " file.cpp line colum" << endl;
        return EXIT_FAILURE;
    }

    CXIndex index = clang_createIndex(0, 0);
    if (!index) {
        cerr << "createIndex failed" << endl;
        return EXIT_FAILURE;
    }

    CXTranslationUnit tu;
    CXErrorCode result = clang_parseTranslationUnit2FullArgv(
        index, argv[1], flags, sizeof(flags) / sizeof(flags[0]), nullptr, 0,
        CXTranslationUnit_DetailedPreprocessingRecord |
            CXTranslationUnit_Incomplete |
            CXTranslationUnit_IncludeBriefCommentsInCodeCompletion |
            CXTranslationUnit_CreatePreambleOnFirstParse |
            CXTranslationUnit_KeepGoing |
            clang_defaultEditingTranslationUnitOptions(),
        &tu);

    if (result != CXError_Success) {
        cerr << "clang_parseTranslationUnit2FullArgv failed" << endl;
        return EXIT_FAILURE;
    }

    const unsigned int line   = strtoul(argv[2], 0, 10);
    const unsigned int column = strtoul(argv[3], 0, 10);

    auto start = high_resolution_clock::now();
    CXCodeCompleteResults *completion_results =
        clang_codeCompleteAt(tu, argv[1], line, column, nullptr, 0,
                             clang_defaultCodeCompleteOptions() |
                                 CXCodeComplete_IncludeBriefComments);
    auto elapsed = high_resolution_clock::now() - start;

    if (!completion_results) {
        cerr << "Could not complete" << endl;
        return EXIT_FAILURE;
    }

    cout << ">>> First run: " << completion_results->NumResults
         << " results in " << duration_cast<milliseconds>(elapsed).count()
         << "ms" << endl;

    clang_disposeCodeCompleteResults(completion_results);

    start = high_resolution_clock::now();
    completion_results =
        clang_codeCompleteAt(tu, argv[1], line, column, nullptr, 0,
                             clang_defaultCodeCompleteOptions() |
                                 CXCodeComplete_IncludeBriefComments);
    elapsed = high_resolution_clock::now() - start;

    if (!completion_results) {
        cerr << "Could not complete" << endl;
        return EXIT_FAILURE;
    }

    cout << ">>> Second run: " << completion_results->NumResults
         << " results in " << duration_cast<milliseconds>(elapsed).count()
         << "ms" << endl;

    clang_disposeCodeCompleteResults(completion_results);

    clang_reparseTranslationUnit(tu, 0, nullptr,
                                 clang_defaultReparseOptions(tu));

    cout << "\n>>> AFTER REPARSE\n" << endl;

    start = high_resolution_clock::now();
    completion_results =
        clang_codeCompleteAt(tu, argv[1], line, column, nullptr, 0,
                             clang_defaultCodeCompleteOptions() |
                                 CXCodeComplete_IncludeBriefComments);
    elapsed = high_resolution_clock::now() - start;

    if (!completion_results) {
        cerr << "Could not complete" << endl;
        return EXIT_FAILURE;
    }

    cout << ">>> Third run: " << completion_results->NumResults
         << " results in " << duration_cast<milliseconds>(elapsed).count()
         << "ms" << endl;

    clang_disposeCodeCompleteResults(completion_results);

    start = high_resolution_clock::now();
    completion_results =
        clang_codeCompleteAt(tu, argv[1], line, column, nullptr, 0,
                             clang_defaultCodeCompleteOptions() |
                                 CXCodeComplete_IncludeBriefComments);
    elapsed = high_resolution_clock::now() - start;

    if (!completion_results) {
        cerr << "Could not complete" << endl;
        return EXIT_FAILURE;
    }

    cout << ">>> Fourth run: " << completion_results->NumResults
         << " results in " << duration_cast<milliseconds>(elapsed).count()
         << "ms" << endl;

    for (unsigned int i = 0;
         i < clang_codeCompleteGetNumDiagnostics(completion_results); i++) {
        const CXDiagnostic &diagnostic =
            clang_codeCompleteGetDiagnostic(completion_results, i);
        const CXString &s = clang_getDiagnosticSpelling(diagnostic);
        cout << clang_getCString(s) << endl;
    }

    clang_disposeCodeCompleteResults(completion_results);
}

Program 2

#include <cstdio>
#include <algorithm>

int main() {
    using namespace std;

    char identifier[6] = {};

    int count = 0;
    for (int i = 0; i < 6250; ++i) {
        for (int j = 0; j < 8; ++j) {
            sprintf(identifier, "%05d", count);
            for (char &c : identifier)
                c = c ? c - '0' + 'a' : c;
            printf("a_a_a_%s ", identifier);
            ++count;
        }
        printf("\n");
    }
}
@bstaletic

This comment has been minimized.

Show comment
Hide comment
@bstaletic

bstaletic May 30, 2017

Collaborator

I gave this a very quick test with my YCM including changes from #2657. It works much faster than before those changes, so could you give that PR a try? The PR is implementing async semantic completion.

Collaborator

bstaletic commented May 30, 2017

I gave this a very quick test with my YCM including changes from #2657. It works much faster than before those changes, so could you give that PR a try? The PR is implementing async semantic completion.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum May 30, 2017

Contributor

Hi @bstaletic. Did you give it a test on case 1, case 2 or program output of Program 2?

Contributor

oblitum commented May 30, 2017

Hi @bstaletic. Did you give it a test on case 1, case 2 or program output of Program 2?

@bstaletic

This comment has been minimized.

Show comment
Hide comment
@bstaletic

bstaletic May 30, 2017

Collaborator

I tried it on all three of those, but I couldn't get Program 1 to work as I don't have Windows installed at all.

Collaborator

bstaletic commented May 30, 2017

I tried it on all three of those, but I couldn't get Program 1 to work as I don't have Windows installed at all.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum May 30, 2017

Contributor

I didn't try #2657 yet, but I did check it before opening this issue. I'll try it but due to the scope of the pull, I was suspecting it wouldn't touch possibly existing issues in the request/display process as a whole.

Notice that Program 1 was compiled and tested on Linux. I didn't use Windows for anything. Also notice that some flags in it may require changes for your machine, like header locations, etc. They're based on .ycm_extra_conf.py flags.

Contributor

oblitum commented May 30, 2017

I didn't try #2657 yet, but I did check it before opening this issue. I'll try it but due to the scope of the pull, I was suspecting it wouldn't touch possibly existing issues in the request/display process as a whole.

Notice that Program 1 was compiled and tested on Linux. I didn't use Windows for anything. Also notice that some flags in it may require changes for your machine, like header locations, etc. They're based on .ycm_extra_conf.py flags.

@bstaletic

This comment has been minimized.

Show comment
Hide comment
@bstaletic

bstaletic May 30, 2017

Collaborator

Actually, my bad. I have not used anything that contains <windows.h>, which means I tested Case 1 and Program 2.

As for Program 1, I managed to get it to work. Completions are not instantanious, but are far from being slow.

Collaborator

bstaletic commented May 30, 2017

Actually, my bad. I have not used anything that contains <windows.h>, which means I tested Case 1 and Program 2.

As for Program 1, I managed to get it to work. Completions are not instantanious, but are far from being slow.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum May 30, 2017

Contributor

@bstaletic my bad, I've updated the issue. I've posted the wrong case 1 snippet, it was to be based on #777 (comment) but I ended up removing the using namespaces. Unsure whether it will make any difference for you it just gets a bit slower on my side on original YCM. The windows.h case is more problematic.

Contributor

oblitum commented May 30, 2017

@bstaletic my bad, I've updated the issue. I've posted the wrong case 1 snippet, it was to be based on #777 (comment) but I ended up removing the using namespaces. Unsure whether it will make any difference for you it just gets a bit slower on my side on original YCM. The windows.h case is more problematic.

@bstaletic

This comment has been minimized.

Show comment
Hide comment
@bstaletic

bstaletic May 30, 2017

Collaborator

Still, I don't seealmost any slow down with the using namespace case. I think you should try #2657 and report back. The PR won't be merged until the end of the week.

Collaborator

bstaletic commented May 30, 2017

Still, I don't seealmost any slow down with the using namespace case. I think you should try #2657 and report back. The PR won't be merged until the end of the week.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum May 30, 2017

Contributor

@bstaletic I've tried #2657. Overall completion is much improved, kudos to @micbou! Still it's behaving as I was expecting.

What it does:

  • No more timeouts, great!
  • Stops making Vim hang while typing!

What it doesn't do:

  • Completion arrives, but late. The slowness in the "request/display process as a whole" isn't solved, which is understandable since this issue isn't a target of that pull request, this is masqueraded by the new async menu refreshing mechanism.
  • Lower CPU consumption.
Contributor

oblitum commented May 30, 2017

@bstaletic I've tried #2657. Overall completion is much improved, kudos to @micbou! Still it's behaving as I was expecting.

What it does:

  • No more timeouts, great!
  • Stops making Vim hang while typing!

What it doesn't do:

  • Completion arrives, but late. The slowness in the "request/display process as a whole" isn't solved, which is understandable since this issue isn't a target of that pull request, this is masqueraded by the new async menu refreshing mechanism.
  • Lower CPU consumption.
@bstaletic

This comment has been minimized.

Show comment
Hide comment
@bstaletic

bstaletic May 30, 2017

Collaborator

@oblitum Your observations seem resonable, but, on my laptop, I just can't agree with "Completion arrives, but late". The initial parse and completions too about two seconds, but to make it that slow I had to use echo 3 | tee /proc/sys/vm/drop_caches. In any other case the completions were almost instant.

That said, I do believe you it can still feel slow for you (and others). But frankly, I have no idea where would I start looking to find the source of the slowness.

As for being resource exensive, Program 1 took ~70MiB at most. I didn't feel any CPU hogging. Those results are right after dropping caches.

Collaborator

bstaletic commented May 30, 2017

@oblitum Your observations seem resonable, but, on my laptop, I just can't agree with "Completion arrives, but late". The initial parse and completions too about two seconds, but to make it that slow I had to use echo 3 | tee /proc/sys/vm/drop_caches. In any other case the completions were almost instant.

That said, I do believe you it can still feel slow for you (and others). But frankly, I have no idea where would I start looking to find the source of the slowness.

As for being resource exensive, Program 1 took ~70MiB at most. I didn't feel any CPU hogging. Those results are right after dropping caches.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum May 30, 2017

Contributor

@bstaletic Here a video demonstrating it: https://vimeo.com/219545420. I've updated Program 2 to 60k identifiers (doubling the original) to make the effect clearly visible. On case 2, the bare inclusion of windows.h will generate around 43k identifiers, the effect can still be felt.

It's also worth noticing that #2657 introduces a new semantic completion behavior: after forced semantic completion, if I backspace to erase some chars of the current incomplete identifier, I'm forced to force semantic completion again to get the original semantic identifier ordering. I ignore the extension of the effects of this but some people may not like.

I'm on a 7700K PC.


Sorry about the second part of the video for the case 2, you'll be unable to know when I hit ctrl-space, I have no keyboard screencast software at the moment, I'll think of doing an update on that later.

Contributor

oblitum commented May 30, 2017

@bstaletic Here a video demonstrating it: https://vimeo.com/219545420. I've updated Program 2 to 60k identifiers (doubling the original) to make the effect clearly visible. On case 2, the bare inclusion of windows.h will generate around 43k identifiers, the effect can still be felt.

It's also worth noticing that #2657 introduces a new semantic completion behavior: after forced semantic completion, if I backspace to erase some chars of the current incomplete identifier, I'm forced to force semantic completion again to get the original semantic identifier ordering. I ignore the extension of the effects of this but some people may not like.

I'm on a 7700K PC.


Sorry about the second part of the video for the case 2, you'll be unable to know when I hit ctrl-space, I have no keyboard screencast software at the moment, I'll think of doing an update on that later.

@bstaletic

This comment has been minimized.

Show comment
Hide comment
@bstaletic

bstaletic May 30, 2017

Collaborator

I knew there was something wrong when I read about your CPU. I'm on an i7 3610QM, so I definitely have less raw power than you. The thing that was wrong about my results was that I was looking at completions in the code that generates the large number of completions.

Unlike you, I'm still hitting a timeout unless I type five character in the generated file.

As for case 1, I can't actually say it's too fast, but considering how big the boost's bimap is, I'm not too surprised.

Collaborator

bstaletic commented May 30, 2017

I knew there was something wrong when I read about your CPU. I'm on an i7 3610QM, so I definitely have less raw power than you. The thing that was wrong about my results was that I was looking at completions in the code that generates the large number of completions.

Unlike you, I'm still hitting a timeout unless I type five character in the generated file.

As for case 1, I can't actually say it's too fast, but considering how big the boost's bimap is, I'm not too surprised.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum May 30, 2017

Contributor

Ah OK, you were testing the program, not the output. I hope others don't fall in the same trap :)

Contributor

oblitum commented May 30, 2017

Ah OK, you were testing the program, not the output. I hope others don't fall in the same trap :)

@bstaletic

This comment has been minimized.

Show comment
Hide comment
@bstaletic

bstaletic May 30, 2017

Collaborator

I noticed one more thing.

It's also worth noticing that #2657 introduces a new semantic completion behavior, after forced semantic completion, if I backspace to erase some chars of the current incomplete identifier, I'm forced to force semantic completion again to get the original semantic identifier ordering. I ignore the extension of the effects of this but some people may not like.

This doesn''t happen with low number of identifiers. Also, I had to hit <C-Space> after reinserting, not after deleting.

Collaborator

bstaletic commented May 30, 2017

I noticed one more thing.

It's also worth noticing that #2657 introduces a new semantic completion behavior, after forced semantic completion, if I backspace to erase some chars of the current incomplete identifier, I'm forced to force semantic completion again to get the original semantic identifier ordering. I ignore the extension of the effects of this but some people may not like.

This doesn''t happen with low number of identifiers. Also, I had to hit <C-Space> after reinserting, not after deleting.

@micbou micbou referenced this issue in Valloric/ycmd Jun 2, 2017

Merged

[READY] Add benchmark infrastructure #769

zzbot added a commit to Valloric/ycmd that referenced this issue Jun 5, 2017

Auto merge of #769 - micbou:google-benchmark, r=bstaletic
[READY] Add benchmark infrastructure

This PR sets the infrastructure for adding benchmarks through [the Google benchmark library](https://github.com/google/benchmark) and for automatically running them on Travis and AppVeyor. They can also be run locally with the `benchmark.py` script. The library is included in the repository for compilation ease. Benchmarks are run on all platforms because optimizations may be platform-dependent.

For now, there is only one benchmark based on the output of *Program 2* in Valloric/YouCompleteMe#2668. It measures the filter and sort algorithm on a worst-case scenario: all identifiers share a common prefix and the query is part of the prefix. In that case, no identifiers are filtered and since they all have the same weight, the algorithm falls back to lexicographic sorting. This scenario is not uncommon in practice. For instance, C libraries often use a common prefix for naming variables and functions to simulate namespaces.

Here's the output of the benchmark on my configuration:
```
------------------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------------------
CandidatesWithCommonPrefix_bench/1           1955 ns       1898 ns     345165
CandidatesWithCommonPrefix_bench/2          11563 ns      11681 ns      64102
CandidatesWithCommonPrefix_bench/4          30761 ns      30594 ns      22436
CandidatesWithCommonPrefix_bench/8          69551 ns      69532 ns      11218
CandidatesWithCommonPrefix_bench/16        143963 ns     143924 ns       4986
CandidatesWithCommonPrefix_bench/32        292668 ns     290603 ns       2362
CandidatesWithCommonPrefix_bench/64        862766 ns     869571 ns        897
CandidatesWithCommonPrefix_bench/128      2205099 ns    2191318 ns        299
CandidatesWithCommonPrefix_bench/256      8895499 ns    8840057 ns         90
CandidatesWithCommonPrefix_bench/512     17704787 ns   17680113 ns         45
CandidatesWithCommonPrefix_bench/1024    45564517 ns   45760293 ns         15
CandidatesWithCommonPrefix_bench/2048    96960893 ns   98057771 ns          7
CandidatesWithCommonPrefix_bench/4096   217881085 ns  218401400 ns          3
CandidatesWithCommonPrefix_bench/8192   481444392 ns  483603100 ns          2
CandidatesWithCommonPrefix_bench/16384 1005462405 ns  982806300 ns          1
CandidatesWithCommonPrefix_bench/32768 1805209871 ns 1809611600 ns          1
CandidatesWithCommonPrefix_bench/65536 4215533125 ns 4212027000 ns          1
CandidatesWithCommonPrefix_bench_BigO     3979.06 NlgN    3974.50 NlgN
CandidatesWithCommonPrefix_bench_RMS           10 %          9 %
```
As you can see, performance becomes unacceptable starting from 16000 identifiers which is not a lot. A great feature of Google benchmark is that it can calculate the algorithm complexity. As expected, we have a `O(n log n)` complexity where `n` is the number of candidates (we are using `std::sort` to sort our candidates).

Thanks to this benchmark, I was able to improve the performance on this particular case by a factor of 60. I'll send the changes once this PR is merged.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/769)
<!-- Reviewable:end -->
@micbou

This comment has been minimized.

Show comment
Hide comment
@micbou

micbou Jun 9, 2017

Collaborator

PR Valloric/ycmd#774 should improve responsiveness and reduce CPU usage in Case 1 and Case 2.

Collaborator

micbou commented Jun 9, 2017

PR Valloric/ycmd#774 should improve responsiveness and reduce CPU usage in Case 1 and Case 2.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum Jun 9, 2017

Contributor

PR Valloric/ycmd#774 should improve responsiveness and reduce CPU usage in Case 1 and Case 2.

Confirmed! It's much improved.

Contributor

oblitum commented Jun 9, 2017

PR Valloric/ycmd#774 should improve responsiveness and reduce CPU usage in Case 1 and Case 2.

Confirmed! It's much improved.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum Jun 9, 2017

Contributor

@micbou IMO, it got so much more usable now, thanks. Your changes improved a lot on the semantic results, but I'm also with others applied, #2657 and this local one:

diff --git a/cpp/ycm/IdentifierDatabase.cpp b/cpp/ycm/IdentifierDatabase.cpp
index 4a44bc7b..5e4e349a 100644
--- a/cpp/ycm/IdentifierDatabase.cpp
diff --git a/cpp/ycm/IdentifierDatabase.cpp b/cpp/ycm/IdentifierDatabase.cpp
index 4a44bc7b..5e4e349a 100644
--- a/cpp/ycm/IdentifierDatabase.cpp
+++ b/cpp/ycm/IdentifierDatabase.cpp
@@ -111,7 +111,10 @@ void IdentifierDatabase::ResultsForQueryAndType(
     }
   }

-  std::sort( results.begin(), results.end() );
+  if (results.size() < 50)
+    std::sort( results.begin(), results.end() );
+  else
+    std::partial_sort( results.begin(), results.begin() + 50, results.end() );
 }


diff --git a/cpp/ycm/PythonSupport.cpp b/cpp/ycm/PythonSupport.cpp
index 3f2d3f1d..0da9334d 100644
--- a/cpp/ycm/PythonSupport.cpp
+++ b/cpp/ycm/PythonSupport.cpp
@@ -101,11 +101,16 @@ boost::python::list FilterAndSortCandidates(
       }
     }

-    std::sort( result_and_objects.begin(), result_and_objects.end() );
+    if (result_and_objects.size() < 50)
+      std::sort( result_and_objects.begin(), result_and_objects.end() );
+    else
+      std::partial_sort( result_and_objects.begin(),
+                         result_and_objects.begin() + 50,
+                         result_and_objects.end() );
   }

-  for ( const ResultAnd< int > &result_and_object : result_and_objects ) {
-    filtered_candidates.append( candidates[ result_and_object.extra_object_ ] );
+  for ( size_t i = 0; i < result_and_objects.size() && i < 50; ++i ) {
+    filtered_candidates.append( candidates[ result_and_objects[i].extra_object_ ] );
Contributor

oblitum commented Jun 9, 2017

@micbou IMO, it got so much more usable now, thanks. Your changes improved a lot on the semantic results, but I'm also with others applied, #2657 and this local one:

diff --git a/cpp/ycm/IdentifierDatabase.cpp b/cpp/ycm/IdentifierDatabase.cpp
index 4a44bc7b..5e4e349a 100644
--- a/cpp/ycm/IdentifierDatabase.cpp
diff --git a/cpp/ycm/IdentifierDatabase.cpp b/cpp/ycm/IdentifierDatabase.cpp
index 4a44bc7b..5e4e349a 100644
--- a/cpp/ycm/IdentifierDatabase.cpp
+++ b/cpp/ycm/IdentifierDatabase.cpp
@@ -111,7 +111,10 @@ void IdentifierDatabase::ResultsForQueryAndType(
     }
   }

-  std::sort( results.begin(), results.end() );
+  if (results.size() < 50)
+    std::sort( results.begin(), results.end() );
+  else
+    std::partial_sort( results.begin(), results.begin() + 50, results.end() );
 }


diff --git a/cpp/ycm/PythonSupport.cpp b/cpp/ycm/PythonSupport.cpp
index 3f2d3f1d..0da9334d 100644
--- a/cpp/ycm/PythonSupport.cpp
+++ b/cpp/ycm/PythonSupport.cpp
@@ -101,11 +101,16 @@ boost::python::list FilterAndSortCandidates(
       }
     }

-    std::sort( result_and_objects.begin(), result_and_objects.end() );
+    if (result_and_objects.size() < 50)
+      std::sort( result_and_objects.begin(), result_and_objects.end() );
+    else
+      std::partial_sort( result_and_objects.begin(),
+                         result_and_objects.begin() + 50,
+                         result_and_objects.end() );
   }

-  for ( const ResultAnd< int > &result_and_object : result_and_objects ) {
-    filtered_candidates.append( candidates[ result_and_object.extra_object_ ] );
+  for ( size_t i = 0; i < result_and_objects.size() && i < 50; ++i ) {
+    filtered_candidates.append( candidates[ result_and_objects[i].extra_object_ ] );
@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum Jun 9, 2017

Contributor

@micbou I've noticed a large improvement on the case for Program 2 output due to the previous patch, before applying Valloric/ycmd#774.

Contributor

oblitum commented Jun 9, 2017

@micbou I've noticed a large improvement on the case for Program 2 output due to the previous patch, before applying Valloric/ycmd#774.

zzbot added a commit to Valloric/ycmd that referenced this issue Aug 25, 2017

Auto merge of #818 - micbou:filter-and-sort-candidates-benchmark, r=b…
…staletic

[READY] Add FilterAndSortCandidates benchmarks

[The `CandidatesForQuery` method](https://github.com/Valloric/ycmd/blob/2575bdb8c83dd5ace3d6f3d72a3425fdf2c18f5e/cpp/ycm/IdentifierCompleter.h#L69) is not the only place where performance is critical. There is also [the `FilterAndSortCandidates` function](https://github.com/Valloric/ycmd/blob/2575bdb8c83dd5ace3d6f3d72a3425fdf2c18f5e/cpp/ycm/PythonSupport.h#L29) which is used in the Python layer to filter and sort candidates returned by completers other than the identifier one (while `CandidatesForQuery` is specific to the identifier completer). We should add benchmarks for this function too.

It would be great for these benchmarks to be based on real usage (e.g. case 1 or 2 from Valloric/YouCompleteMe#2668). This could be done by storing the candidates in a file that would be read by the benchmarks to get the candidates. However, this kind of file would be too big (> 1MB) to be committed to the repository. We would have to download the file from somewhere else to run the benchmarks. I didn't want to bother so I went for the same candidates as the `CandidatesForQuery` benchmark. Those candidates represent a corner case but, as long as we don't specifically target them, they should be good enough.

Two benchmarks are added: one when the candidates are not already stored in the repository and another when they are. This is important because both situations arise in practice: when filtering and sorting the candidates for the first time, they are added to the repository (user pressing `<C-Space>`) then they are retrieved from the repository next times (user typing characters to filter out candidates).

Here are the benchmark results on my configuration:
```
Run on (4 X 3504 MHz CPU s)
------------------------------------------------------------------------------------------------------------------
Benchmark                                                                           Time           CPU Iterations
------------------------------------------------------------------------------------------------------------------
IdentifierCompleterFixture/CandidatesWithCommonPrefix/1024                     512467 ns     514442 ns       1122
IdentifierCompleterFixture/CandidatesWithCommonPrefix/2048                    1144025 ns    1143845 ns        641
IdentifierCompleterFixture/CandidatesWithCommonPrefix/4096                    2643405 ns    2659108 ns        264
IdentifierCompleterFixture/CandidatesWithCommonPrefix/8192                    6263122 ns    6267897 ns        112
IdentifierCompleterFixture/CandidatesWithCommonPrefix/16384                  13002049 ns   12535795 ns         56
IdentifierCompleterFixture/CandidatesWithCommonPrefix/32768                  28741026 ns   28704184 ns         25
IdentifierCompleterFixture/CandidatesWithCommonPrefix/65536                  60231116 ns   60840390 ns         10
IdentifierCompleterFixture_BigO                                                 57.59 NlgN      57.96 NlgN
IdentifierCompleterFixture_RMS                                                      1 %          2 %
PythonSupportFixture/FilterAndSortUnstoredCandidatesWithCommonPrefix/1024     4280306 ns    4680030 ns        150
PythonSupportFixture/FilterAndSortUnstoredCandidatesWithCommonPrefix/2048     9249671 ns    9186726 ns         90
PythonSupportFixture/FilterAndSortUnstoredCandidatesWithCommonPrefix/4096    18832170 ns   18373451 ns         45
PythonSupportFixture/FilterAndSortUnstoredCandidatesWithCommonPrefix/8192    38904981 ns   37266906 ns         18
PythonSupportFixture/FilterAndSortUnstoredCandidatesWithCommonPrefix/16384   78318612 ns   78000500 ns          9
PythonSupportFixture/FilterAndSortUnstoredCandidatesWithCommonPrefix/32768  158404771 ns  163801050 ns          4
PythonSupportFixture/FilterAndSortUnstoredCandidatesWithCommonPrefix/65536  317453915 ns  319802050 ns          2
PythonSupportFixture_BigO                                                     4836.93 N    4891.16 N
PythonSupportFixture_RMS                                                            1 %          2 %
PythonSupportFixture/FilterAndSortStoredCandidatesWithCommonPrefix/1024        542299 ns     530403 ns       1000
PythonSupportFixture/FilterAndSortStoredCandidatesWithCommonPrefix/2048       1264767 ns    1279153 ns        561
PythonSupportFixture/FilterAndSortStoredCandidatesWithCommonPrefix/4096       2724379 ns    2718199 ns        264
PythonSupportFixture/FilterAndSortStoredCandidatesWithCommonPrefix/8192       6263308 ns    6267897 ns        112
PythonSupportFixture/FilterAndSortStoredCandidatesWithCommonPrefix/16384     12994143 ns   12792082 ns         50
PythonSupportFixture/FilterAndSortStoredCandidatesWithCommonPrefix/32768     27466364 ns   27300175 ns         28
PythonSupportFixture/FilterAndSortStoredCandidatesWithCommonPrefix/65536     57326742 ns   56727636 ns         11
PythonSupportFixture_BigO                                                       54.99 NlgN      54.45 NlgN
PythonSupportFixture_RMS                                                            2 %          2 %
```
We see that filtering and sorting is much slower when candidates are not already stored so this is something we need to consider when trying to improve performance.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/818)
<!-- Reviewable:end -->
@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum Sep 5, 2017

Contributor

Folks, feel free to close this if it reaches a good enough state. I've been stuck with the improvements due to my previous patch but couldn't move further on my fork due to conflicting changes with #2657 regarding parameter hints. As far as I've tested up to my previous patch I was satisfied enough already, despite not leveraging the new async system.

Contributor

oblitum commented Sep 5, 2017

Folks, feel free to close this if it reaches a good enough state. I've been stuck with the improvements due to my previous patch but couldn't move further on my fork due to conflicting changes with #2657 regarding parameter hints. As far as I've tested up to my previous patch I was satisfied enough already, despite not leveraging the new async system.

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum Sep 5, 2017

Contributor

No time for looking into Vim completion at the moment at all.

Contributor

oblitum commented Sep 5, 2017

No time for looking into Vim completion at the moment at all.

@vheon

This comment has been minimized.

Show comment
Hide comment
@vheon

vheon Sep 5, 2017

Collaborator

@oblitum just to let you know we've implemented partial sorting in ycmd ;)

Collaborator

vheon commented Sep 5, 2017

@oblitum just to let you know we've implemented partial sorting in ycmd ;)

@oblitum

This comment has been minimized.

Show comment
Hide comment
@oblitum

oblitum Sep 5, 2017

Contributor

@vheon that's great!

Contributor

oblitum commented Sep 5, 2017

@vheon that's great!

@micbou

This comment has been minimized.

Show comment
Hide comment
@micbou

micbou Mar 13, 2018

Collaborator

Completing at global scope is now almost instant in cases 1 and 2 so we can definitely close this.

Collaborator

micbou commented Mar 13, 2018

Completing at global scope is now almost instant in cases 1 and 2 so we can definitely close this.

@micbou micbou closed this Mar 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment