-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[READY] Fix issues with multi-byte characters #455
Conversation
Will happily rebase this and squash the history before merging, but i thought all the history might help reviewers. This is marked as RFC mainly because it is quite an invasive change and quite a big one. I'm pretty confident that it resolves a lot of issues, but there is always the risk of a regression. |
Sweet Jesus, this is fantastic! :D Must have been a horrendous amount of work. Thanks so much! Haven't yet started reviewing this; I'll try to find the time tomorrow. |
I'm not even going to attempt to estimate the depths of suffering you must have reached while fixing all of this. with minor comments. WRT keeping the commits, I'm not at all against PRs having more than on commit, merely against having nonsensical commits in Reviewed 42 of 43 files at r1, 1 of 1 files at r2. ycmd/identifier_utils.py, line 54 [r2] (raw file): ycmd/identifier_utils.py, line 56 [r2] (raw file): ycmd/request_wrap.py, line 102 [r2] (raw file): ycmd/request_wrap.py, line 127 [r2] (raw file): ycmd/request_wrap.py, line 150 [r2] (raw file): ycmd/responses.py, line 64 [r2] (raw file): God damn you Python. God damn you. ycmd/responses.py, line 137 [r2] (raw file): ycmd/server_state.py, line 115 [r2] (raw file): ycmd/utils.py, line 134 [r2] (raw file): ycmd/utils.py, line 135 [r2] (raw file): ycmd/utils.py, line 137 [r2] (raw file): ycmd/utils.py, line 154 [r2] (raw file): ycmd/utils.py, line 165 [r2] (raw file): ycmd/completers/completer.py, line 43 [r2] (raw file): ycmd/completers/completer.py, line 78 [r2] (raw file): ycmd/completers/completer_utils.py, line 121 [r2] (raw file): ycmd/completers/completer_utils.py, line 175 [r2] (raw file): ycmd/completers/completer_utils.py, line 194 [r2] (raw file): This works too. ycmd/completers/completer_utils.py, line 288 [r2] (raw file): Might also want to explain why this is done (to support unsaved files in editors). ycmd/completers/cpp/clang_completer.py, line 354 [r2] (raw file): ycmd/completers/go/go_completer.py, line 204 [r2] (raw file): ycmd/tests/test_utils.py, line 197 [r2] (raw file): ycmd/tests/cs/testdata/testy/Unicode.cs, line 14 [r2] (raw file): Comments from Reviewable |
Review status: all files reviewed at latest revision, 22 unresolved discussions, some commit checks failed. ycmd/completers/completer_utils.py, line 222 [r2] (raw file): Comments from Reviewable |
Review status: all files reviewed at latest revision, 22 unresolved discussions, some commit checks failed. ycmd/completers/completer_utils.py, line 222 [r2] (raw file): Comments from Reviewable |
Review status: 32 of 44 files reviewed at latest revision, 22 unresolved discussions. ycmd/identifier_utils.py, line 54 [r2] (raw file): ycmd/identifier_utils.py, line 56 [r2] (raw file): ycmd/request_wrap.py, line 102 [r2] (raw file):
Whereas
However, the difference seems to be with windows line endings:
Maybe we just need to write a version of ycmd/request_wrap.py, line 127 [r2] (raw file): I think the issue here is not strictly with the To explain: As it happens, multi-byte characters in the query don't work anyway (due to the filtering logic) so that's probably why i missed it. I think I need to change this to:
Which is not only more correct, but simpler and more efficient (I think). As it happens I can't seem to write a test which breaks this (which matches with my other testing), though I still think this is working with characters, so should be using code points not bytes for simplicity. ycmd/request_wrap.py, line 150 [r2] (raw file): ycmd/responses.py, line 137 [r2] (raw file): ycmd/server_state.py, line 115 [r2] (raw file): Happy to remove it if you think it is unlikely to be more generally useful. There is of course a related performance cost. ycmd/utils.py, line 134 [r2] (raw file): ycmd/utils.py, line 135 [r2] (raw file): ycmd/utils.py, line 137 [r2] (raw file): ycmd/utils.py, line 154 [r2] (raw file): ycmd/utils.py, line 165 [r2] (raw file): ycmd/completers/completer.py, line 78 [r2] (raw file): ycmd/completers/completer_utils.py, line 121 [r2] (raw file): ycmd/completers/completer_utils.py, line 175 [r2] (raw file): ycmd/completers/completer_utils.py, line 194 [r2] (raw file): std::string GetUtf8String( const boost::python::object &string_or_unicode ) {
extract< std::string > to_string( string_or_unicode );
if ( to_string.check() )
return to_string();
return extract< std::string >( str( string_or_unicode ).encode( "utf8" ) ); // here
} I can't remember precisely, but I split this method up and it was throwing an exception on the marked line: extract< std::string > to_string( string_or_unicode );
if ( to_string.check() )
return to_string();
auto s = str( string_or_unicode );
s.encode( "utf8" ); // here IIRC, though i tmight have been the previous line.
return extract< std::string >( s ); I have to admit I don't know the python C api or the boost part well-enough to fix it here (in particular how it interacts with python-future), so I just converted int he python layer where it was more familiar, if significantly less efficient. If anyone has any ideas how to better write the ycmd/completers/completer_utils.py, line 222 [r2] (raw file): ycmd/completers/completer_utils.py, line 288 [r2] (raw file): ycmd/completers/cpp/clang_completer.py, line 354 [r2] (raw file): Amy objection to leaving this for later/never and making it a FIXME? ycmd/completers/go/go_completer.py, line 204 [r2] (raw file): Removed the NOTE qualifier as this is really just commentary ycmd/tests/test_utils.py, line 197 [r2] (raw file): ycmd/tests/cs/testdata/testy/Unicode.cs, line 14 [r2] (raw file): Comments from Reviewable |
Review status: 32 of 44 files reviewed at latest revision, 10 unresolved discussions. ycmd/completers/completer_utils.py, line 222 [r2] (raw file): Comments from Reviewable |
Review status: 32 of 44 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed. ycmd/request_wrap.py, line 102 [r2] (raw file): ycmd/request_wrap.py, line 127 [r2] (raw file): I'm fine with the proposed codepoint version if for any reason, then because it's more obviously correct. ycmd/server_state.py, line 115 [r2] (raw file): ycmd/completers/completer_utils.py, line 194 [r2] (raw file): ycmd/completers/cpp/clang_completer.py, line 354 [r2] (raw file): ycmd/tests/test_utils.py, line 197 [r2] (raw file): Comments from Reviewable |
6d8950e
to
8f2d9cf
Compare
Review status: 29 of 44 files reviewed at latest revision, 6 unresolved discussions. ycmd/request_wrap.py, line 127 [r2] (raw file): ycmd/server_state.py, line 115 [r2] (raw file): ycmd/completers/completer_utils.py, line 194 [r2] (raw file): ycmd/completers/cpp/clang_completer.py, line 354 [r2] (raw file): Comments from Reviewable |
Review status: 25 of 44 files reviewed at latest revision, 5 unresolved discussions. ycmd/request_wrap.py, line 102 [r2] (raw file): Comments from Reviewable |
Reviewed 8 of 12 files at r3, 4 of 4 files at r4, 7 of 7 files at r5. Comments from Reviewable |
Review status: all files reviewed at latest revision, 50 unresolved discussions, some commit checks failed. ycmd/utils.py, line 376 [r5] (raw file): (I don't personally care. I accept both spellings.) Comments from Reviewable |
There's an issue with omnifunc completer: ycm-core/YouCompleteMe#2096 (comment) It might be in YCM rather than here, but certainly it blocks merging :) Review status: all files reviewed at latest revision, 50 unresolved discussions, some commit checks failed. Comments from Reviewable |
947ba79
to
552d35e
Compare
552d35e
to
24f0f21
Compare
Review status: 26 of 44 files reviewed at latest revision, 50 unresolved discussions. ycmd/utils.py, line 369 [r5] (raw file): Comments from Reviewable |
Change lots of places to work with chars or bytes correctly Add lots of comments and TODOs. Vain (and broken) attempt to fix tern renames
- Add tests for to/from byte offset - Add tests for RefactorRename javascript - Test unicode with gocode end-to-end. Send the start_column rather than the current column so that ycmd matching is used - Add missing c-sharp test file - Upgrade @expectedfailure to support matching the exception raised. Use it to show that identifier completer is busted
…to FilterAndSortCandidates
c052e2e
to
417a7fe
Compare
Great work!! I've only spotted a typo. Reviewed 18 of 43 files at r1, 4 of 12 files at r3, 2 of 4 files at r4, 1 of 7 files at r5, 14 of 17 files at r7, 1 of 1 files at r8, 2 of 2 files at r9, 2 of 2 files at r10. ycmd/tests/cs/get_completions_test.py, line 57 [r10] (raw file): Comments from Reviewable |
…dd behaviour for empty string and the string containing only a newline Additional tidying up: - Remove ToHex which was for debugging only - Fix windows test failures - always return a proper path - Correct many typographical errors and change Query calculation to work in codepoints not bytes, for consistency and clarity. - Add test and explanation for the deep copy of candidates - Remove debug code - Minor typo corrections - Fix TypeScript unicode tests on Windows TSServer adds a newline at the end of the response message and counts it as one character (\n) towards the content length. But newlines are two characters on Windows (\r\n). To take care of that, the universal_newlines option is set when starting the TSServer subprocess. This option automatically converts Windows newlines to \n. However, it also opens the stdin, stdout, and stderr as text streams instead of binary ones. This does not work properly with unicode characters on Windows and Python 3. Therefore, we directly increment the content length if we are on Windows instead of using the universal_newlines option. - Update unicode tests These tests don't raise an IndexError exception anymore but return an empty list of candidates instead. - Remove unnecessary logging objects/imports
417a7fe
to
7330fa0
Compare
Review status: 43 of 44 files reviewed at latest revision, 1 unresolved discussion. ycmd/tests/cs/get_completions_test.py, line 57 [r10] (raw file): Comments from Reviewable |
Sweet Jesus, is this actually landing? :D Awesomesauce! @homu r+ Review status: 43 of 44 files reviewed at latest revision, 1 unresolved discussion. Comments from Reviewable |
📌 Commit 7330fa0 has been approved by |
⚡ Test exempted - status |
[READY] Fix issues with multi-byte characters ## Summary This change introduces more general support for non-ASCII characters in buffers handled by YCMD. In ycmd's public API, all offsets are byte offsets into the UTF-8 encoded buffers. We also assume (because, we have no other choice) that files stored on disk are also UTF-8 encoded. Internally, almost all of ycmd's functionality operates on unicode strings (python 2 `unicode()` and python 3 `str()` objects, transparently via `future`). Many of the downstream completion engines expect unicode code points as the offsets in their APIs. One special case is the `ycm_core` library (identifier completer and clang completer), which requires instances of the _native_ `str` type. All strings used within the c++ using `boost::python` require passing through `ToCppStringCompatible` Previously, we were largely just assuming that `code point == byte offset` - i.e. all buffers contained only ASCII characters. This worked up to a point, but more by luck than judgement in a number of places. ## References In combination with a YCM change and PR #453, I hope this: - fixes #109 - fixes ycm-core/YouCompleteMe#2096 - fixes ycm-core/YouCompleteMe#2088 - fixes ycm-core/YouCompleteMe#2069 - fixes ycm-core/YouCompleteMe#2066 - fixes ycm-core/YouCompleteMe#1378 ## Overview of changes The changes fall into the following areas: - Providing access to and conversion to/from code points and byte offsets (`request_wrap.py`) - Changing certain algorithms/features to work entirely in codepoint space when they are trying to operate on logical 'characters' within the buffer (see known issues for why this isn't perfect, but probably most of the way there) - Changing the completers to convert between the external (on both sides) and internal representations by using the shortcuts provided in `request_wrap.py` - Adding tests for each of the completers for both completions and subcommands ## Completer-specific notes Pretty much all of the completers I tested required some changes: - clang uses utf-8 and byte offsets, but had some bugs with the `GetDoc` parsing stuff - OmniSharp speaks codepoint offsets - Tern speaks codepoint offsets - JediHTTP speaks codepoint offsets - tsserver speaks codepoint offsets - gocode speaks byte offsets - racer i did not test ## Further work / Known issues - we act blissfully ignorant of the case where a unicode character consumes multiple code points (such as where there is a modifier after the code point) - when typing a unicode character, we still get an exception from `bitset` (see #453 for that fix) - the filtering and sorting system is 100% designed for ASCII only, and it is not in the scope of this PR to change that. Currently after any filtering operation, words containing non-ASCII characters are excluded. - I did not get round to testing rust using racer - there are further changes required to YouCompleteMe client (a further PR is coming for that) <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/455) <!-- Reviewable:end -->
[READY] Fixes for multi-byte errors # PR Prelude Thank you for working on YCM! :) **Please complete these steps and check these boxes (by putting an `x` inside the brackets) _before_ filing your PR:** - [X] I have read and understood YCM's [CONTRIBUTING][cont] document. - [X] I have read and understood YCM's [CODE_OF_CONDUCT][code] document. - [X] I have included tests for the changes in my PR. If not, I have included a rationale for why I haven't. - [X] **I understand my PR may be closed if it becomes obvious I didn't actually perform all of these steps.** # Why this change is necessary and useful There are a number of recent errors with unicode (most of which caused by the server, see PR ycm-core/ycmd#455. In testing I fixed a number of client-side tracebacks also. This is by no means a comprehensive set of fixes for the client - I have simply fixed those that I came across in testing. Summary: - fixes for errors when typing in c-sharp files due to the completion done handler - fixes for FixIts to apply correctly with multi-byte characters - fixes for unicode characters in return from the omni completer [cont]: https://github.com/Valloric/YouCompleteMe/blob/master/CONTRIBUTING.md [code]: https://github.com/Valloric/YouCompleteMe/blob/master/CODE_OF_CONDUCT.md <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/youcompleteme/2108) <!-- Reviewable:end -->
Summary
This change introduces more general support for non-ASCII characters in buffers handled by YCMD.
In ycmd's public API, all offsets are byte offsets into the UTF-8 encoded buffers. We also assume (because, we have no other choice) that files stored on disk are also UTF-8 encoded. Internally, almost all of ycmd's functionality operates on unicode strings (python 2
unicode()
and python 3str()
objects, transparently viafuture
). Many of the downstream completion engines expect unicode code points as the offsets in their APIs. One special case is theycm_core
library (identifier completer and clang completer), which requires instances of the nativestr
type. All strings used within the c++ usingboost::python
require passing throughToCppStringCompatible
Previously, we were largely just assuming that
code point == byte offset
- i.e. all buffers contained only ASCII characters. This worked up to a point, but more by luck than judgement in a number of places.References
In combination with a YCM change and PR #453, I hope this:
Overview of changes
The changes fall into the following areas:
request_wrap.py
)request_wrap.py
Completer-specific notes
Pretty much all of the completers I tested required some changes:
GetDoc
parsing stuffFurther work / Known issues
bitset
(see [READY] Fix IndexError exception from C++ #453 for that fix)This change is