Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[READY] Fix issues with multi-byte characters #455

Merged
merged 16 commits into from
Apr 24, 2016

Conversation

puremourning
Copy link
Member

Summary

This change introduces more general support for non-ASCII characters in buffers handled by YCMD.

In ycmd's public API, all offsets are byte offsets into the UTF-8 encoded buffers. We also assume (because, we have no other choice) that files stored on disk are also UTF-8 encoded. Internally, almost all of ycmd's functionality operates on unicode strings (python 2 unicode() and python 3 str() objects, transparently via future). Many of the downstream completion engines expect unicode code points as the offsets in their APIs. One special case is the ycm_core library (identifier completer and clang completer), which requires instances of the native str type. All strings used within the c++ using boost::python require passing through ToCppStringCompatible

Previously, we were largely just assuming that code point == byte offset - i.e. all buffers contained only ASCII characters. This worked up to a point, but more by luck than judgement in a number of places.

References

In combination with a YCM change and PR #453, I hope this:

Overview of changes

The changes fall into the following areas:

  • Providing access to and conversion to/from code points and byte offsets (request_wrap.py)
  • Changing certain algorithms/features to work entirely in codepoint space when they are trying to operate on logical 'characters' within the buffer (see known issues for why this isn't perfect, but probably most of the way there)
  • Changing the completers to convert between the external (on both sides) and internal representations by using the shortcuts provided in request_wrap.py
  • Adding tests for each of the completers for both completions and subcommands

Completer-specific notes

Pretty much all of the completers I tested required some changes:

  • clang uses utf-8 and byte offsets, but had some bugs with the GetDoc parsing stuff
  • OmniSharp speaks codepoint offsets
  • Tern speaks codepoint offsets
  • JediHTTP speaks codepoint offsets
  • tsserver speaks codepoint offsets
  • gocode speaks byte offsets
  • racer i did not test

Further work / Known issues

  • we act blissfully ignorant of the case where a unicode character consumes multiple code points (such as where there is a modifier after the code point)
  • when typing a unicode character, we still get an exception from bitset (see [READY] Fix IndexError exception from C++ #453 for that fix)
  • the filtering and sorting system is 100% designed for ASCII only, and it is not in the scope of this PR to change that. Currently after any filtering operation, words containing non-ASCII characters are excluded.
  • I did not get round to testing rust using racer
  • there are further changes required to YouCompleteMe client (a further PR is coming for that)

This change is Reviewable

@puremourning
Copy link
Member Author

Will happily rebase this and squash the history before merging, but i thought all the history might help reviewers.

This is marked as RFC mainly because it is quite an invasive change and quite a big one. I'm pretty confident that it resolves a lot of issues, but there is always the risk of a regression.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 84.704% when pulling bc5e153 on puremourning:unicode-investigation into 09f2164 on Valloric:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 84.704% when pulling f1d1144 on puremourning:unicode-investigation into 09f2164 on Valloric:master.

@Valloric
Copy link
Member

Sweet Jesus, this is fantastic! :D Must have been a horrendous amount of work. Thanks so much!

Haven't yet started reviewing this; I'll try to find the time tomorrow.

@Valloric
Copy link
Member

I'm not even going to attempt to estimate the depths of suffering you must have reached while fixing all of this. :lgtm: with minor comments.

WRT keeping the commits, I'm not at all against PRs having more than on commit, merely against having nonsensical commits in master. So if you carefully pruned your PR history with git rebase -i so that all commits are relevant and are something we'd like to keep going forward, then great, let's have them!


Reviewed 42 of 43 files at r1, 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 23 unresolved discussions, some commit checks failed.


ycmd/identifier_utils.py, line 54 [r2] (raw file):
"at the" -> "at"


ycmd/identifier_utils.py, line 56 [r2] (raw file):
"a 'alpha'" -> "an 'alpha'"


ycmd/request_wrap.py, line 102 [r2] (raw file):
Wait, this works? Why the hell does splitlines() even exist then if doing split('\n') is less buggy?


ycmd/request_wrap.py, line 127 [r2] (raw file):
I see a "bytes number - 1" expression, which makes me instantly doubt it. Are we sure this is correct?


ycmd/request_wrap.py, line 150 [r2] (raw file):
Should end with period.


ycmd/responses.py, line 64 [r2] (raw file):
If we only had something that could check these kinds of invariants for us... some sort of "static type checking" machinery...

God damn you Python. God damn you.


ycmd/responses.py, line 137 [r2] (raw file):
Is this a before-submit TODO or an after-submit one?


ycmd/server_state.py, line 115 [r2] (raw file):
Why are we doing this again? Not saying I'm against it, just asking.


ycmd/utils.py, line 134 [r2] (raw file):
Comma not needed.


ycmd/utils.py, line 135 [r2] (raw file):
Second comma not needed.


ycmd/utils.py, line 137 [r2] (raw file):
", etc. ," -> "and similar"


ycmd/utils.py, line 154 [r2] (raw file):
Same changes to doc as above.


ycmd/utils.py, line 165 [r2] (raw file):
Maybe ToBytes instead of encode?


ycmd/completers/completer.py, line 43 [r2] (raw file):
Thanks for this!


ycmd/completers/completer.py, line 78 [r2] (raw file):
You might want to include a link to the UTF-8 Everywhere manifesto somewhere in this section.


ycmd/completers/completer_utils.py, line 121 [r2] (raw file):
Misplaced #?


ycmd/completers/completer_utils.py, line 175 [r2] (raw file):
Might want to disambiguate "future" into "python-future". Also, something bad happened to the sentence after it: "so if we pass in a pass it"? :)


ycmd/completers/completer_utils.py, line 194 [r2] (raw file):
I remember writing code in the C++ layer for FilterAndSortCandidates so that it can accept unicode objects too. The idea was to make this conversion in the Python layer unnecessary, though it seems I did a mess of things.

This works too.


ycmd/completers/completer_utils.py, line 288 [r2] (raw file):
This might need a bit more clarification. What this function is actually doing is reading the file data from the request, and if it's not there, it reads it from disk.

Might also want to explain why this is done (to support unsaved files in editors).


ycmd/completers/cpp/clang_completer.py, line 354 [r2] (raw file):
Now or later for TODO?


ycmd/completers/go/go_completer.py, line 204 [r2] (raw file):
A "Ben" note? :)


ycmd/tests/test_utils.py, line 197 [r2] (raw file):
Damn straight, no bugs should suddenly disappear without our explicit approval! :D


ycmd/tests/cs/testdata/testy/Unicode.cs, line 14 [r2] (raw file):
Tabs? :(


Comments from Reviewable

@micbou
Copy link
Collaborator

micbou commented Apr 10, 2016

Review status: all files reviewed at latest revision, 22 unresolved discussions, some commit checks failed.


ycmd/completers/completer_utils.py, line 222 [r2] (raw file):
We need to make a deep copy of the candidates list. Otherwise, we are modifying the cached completions when filtering the candidates and all unfiltered candidates become byte objects:
ycmd-pr-455.gif


Comments from Reviewable

@micbou
Copy link
Collaborator

micbou commented Apr 10, 2016

Review status: all files reviewed at latest revision, 22 unresolved discussions, some commit checks failed.


ycmd/completers/completer_utils.py, line 222 [r2] (raw file):
A more efficient solution is to call FilterAndSortCandidates even if request[ 'query' ] is the empty string.


Comments from Reviewable

@puremourning
Copy link
Member Author

Review status: 32 of 44 files reviewed at latest revision, 22 unresolved discussions.


ycmd/identifier_utils.py, line 54 [r2] (raw file):
Done.


ycmd/identifier_utils.py, line 56 [r2] (raw file):
Done.


ycmd/request_wrap.py, line 102 [r2] (raw file):
From what I can tell splitlines works very strangely:

>>> ''.splitlines()
[]
>>> ' '.splitlines()
[' ']
>>> '\n'.splitlines()
['']
>>> ' \n '.splitlines()
[' ', ' ']

Whereas .split( '\n' ) behaves not only more consistently, but more like what we want here (i.e. an array containing the "lines" in the buffer:

>>> ''.split( '\n' )
['']
>>> ' '.split( '\n' )
[' ']
>>> '\n'.split( '\n' )
['', '']
>>> ' \n '.split( '\n' )
[' ', ' ']

However, the difference seems to be with windows line endings:

>>> '\r\n'.split( '\n' )
['\r', '']
>>> '\r\n'.splitlines()
['']

Maybe we just need to write a version of splitlines() which behaves more consistently?


ycmd/request_wrap.py, line 127 [r2] (raw file):
I think you may be on to something. Thanks!

I think the issue here is not strictly with the - 1, but with the fact that taking a range of bytes like this is not safe: column_num might be the (1-based) index of a multi-byte char, and so cutting the bytes like this might truncate the last "character" in the query.

To explain: start_column and column_num are 1-based byte offsets pointing at the first byte of their respective "n-byte" character. All the - 1 here does is ensure that they are correct indexes into the 0-based string line_bytes. This code slices up the bytes to get the range of chars after the start_column up to the column_num (i.e. the query after the .) so that we can filter it.

As it happens, multi-byte characters in the query don't work anyway (due to the filtering logic) so that's probably why i missed it.

I think I need to change this to:

query = self[ 'line_value' ][ self[ 'start_codepoint' ] - 1 : self[ 'column_codepoint' ] - 1 ]

Which is not only more correct, but simpler and more efficient (I think).


As it happens I can't seem to write a test which breaks this (which matches with my other testing), though I still think this is working with characters, so should be using code points not bytes for simplicity.


ycmd/request_wrap.py, line 150 [r2] (raw file):
Done.


ycmd/responses.py, line 137 [r2] (raw file):
It is a legacy comment. I totally checked them all (he says, frantically checking them all).


ycmd/server_state.py, line 115 [r2] (raw file):
There have been a number of time in testing where the log indicated that semantic completion wasn't being applied (due to pebkac), but it wasn't obvious why. I made this change to debug it, and never removed it. I can't actually remember the specifics, but it isn't the first time i've made a similar change for debugging.

Happy to remove it if you think it is unlikely to be more generally useful. There is of course a related performance cost.


ycmd/utils.py, line 134 [r2] (raw file):
Done.


ycmd/utils.py, line 135 [r2] (raw file):
Done.


ycmd/utils.py, line 137 [r2] (raw file):
Done.


ycmd/utils.py, line 154 [r2] (raw file):
Done.


ycmd/utils.py, line 165 [r2] (raw file):
Done.


ycmd/completers/completer.py, line 78 [r2] (raw file):
Done.


ycmd/completers/completer_utils.py, line 121 [r2] (raw file):
Done.


ycmd/completers/completer_utils.py, line 175 [r2] (raw file):
Done.


ycmd/completers/completer_utils.py, line 194 [r2] (raw file):
Well this was the thing that i spent 3 days debugging. And I found that converting to std::string for a unicode object didn't work (specifically, in PythonSupport.cpp:YouCompleteMe::GetUtf8String, it was throwing an exception on this line:

std::string GetUtf8String( const boost::python::object &string_or_unicode ) {
  extract< std::string > to_string( string_or_unicode );

  if ( to_string.check() )
    return to_string();

  return extract< std::string >( str( string_or_unicode ).encode( "utf8" ) ); // here
}

I can't remember precisely, but I split this method up and it was throwing an exception on the marked line:

  extract< std::string > to_string( string_or_unicode );

  if ( to_string.check() )
    return to_string();

  auto s = str( string_or_unicode );
  s.encode( "utf8" ); // here IIRC, though i tmight have been the previous line.

  return extract< std::string >( s );

I have to admit I don't know the python C api or the boost part well-enough to fix it here (in particular how it interacts with python-future), so I just converted int he python layer where it was more familiar, if significantly less efficient.

If anyone has any ideas how to better write the GetUtf8String function, then I can have a go at it :)


ycmd/completers/completer_utils.py, line 222 [r2] (raw file):
Great catch, thanks!


ycmd/completers/completer_utils.py, line 288 [r2] (raw file):
Done.


ycmd/completers/cpp/clang_completer.py, line 354 [r2] (raw file):
Oh crap I forgot about that. It is probably quite tricky to change and niche. It is probably OK too: nearest character vs nearest byte probably actually makes little difference.

Amy objection to leaving this for later/never and making it a FIXME?


ycmd/completers/go/go_completer.py, line 204 [r2] (raw file):
Doh. This was probably originally a TODO that didn't get the memo about not needing a name when it got promoted to a NOTE.

Removed the NOTE qualifier as this is really just commentary


ycmd/tests/test_utils.py, line 197 [r2] (raw file):
"Passing unexpectedly". It's rare but it means that the tests are tided up when we do fix bugs (rather than confusing, "why is this expected to fail, but isn't?" scenarios).


ycmd/tests/cs/testdata/testy/Unicode.cs, line 14 [r2] (raw file):
Humph. This is copy pasta from the other .cs test files. I have fixed this one, but not the others, as that may change offsets in the files :)


Comments from Reviewable

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 84.708% when pulling 6d8950e on puremourning:unicode-investigation into 09f2164 on Valloric:master.

@puremourning
Copy link
Member Author

Review status: 32 of 44 files reviewed at latest revision, 10 unresolved discussions.


ycmd/completers/completer_utils.py, line 222 [r2] (raw file):
You know what I can't actually repro this. Am i missing something obvious?


Comments from Reviewable

@Valloric
Copy link
Member

Review status: 32 of 44 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


ycmd/request_wrap.py, line 102 [r2] (raw file):
Yes, it seems we might need to write our own splitlines. Shouldn't be hard.


ycmd/request_wrap.py, line 127 [r2] (raw file):
Ah, I think this might be correct after all. It's parsing a UTF-8 slice as unicode, but this slice should always be correct if the client is sending correct data. In other words, if the client is sending offsets into the middle of a multi-byte character, it has a bug.

I'm fine with the proposed codepoint version if for any reason, then because it's more obviously correct.


ycmd/server_state.py, line 115 [r2] (raw file):
I'm a bit concerned that we're exposing this in the API, so clients may depend on it. Maybe only expose it if logging level is debug?


ycmd/completers/completer_utils.py, line 194 [r2] (raw file):
I think this is fine for now, but maybe leave a TODO that we should possibly do better in the C++ layer.


ycmd/completers/cpp/clang_completer.py, line 354 [r2] (raw file):
No problem with leaving a TODO.


ycmd/tests/test_utils.py, line 197 [r2] (raw file):
Oh I agree with you, it should totally fail if it starts passing without us knowing. I just thought it was funny. :)


Comments from Reviewable

@puremourning
Copy link
Member Author

Review status: 29 of 44 files reviewed at latest revision, 6 unresolved discussions.


ycmd/request_wrap.py, line 127 [r2] (raw file):
Of course. I initially wrote a response explaining why this was OK, then I misinterpreted the slice syntax as including the offset after the : [it doesn't because it is open at the high end). So the old code was working, if not obvious.


ycmd/server_state.py, line 115 [r2] (raw file):
I think the complexity involved in that outweighs the benefit, so I've just removed it. It isn't a big deal and doesn't add a huge amount of value.


ycmd/completers/completer_utils.py, line 194 [r2] (raw file):
Done.


ycmd/completers/cpp/clang_completer.py, line 354 [r2] (raw file):
Done.


Comments from Reviewable

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 84.708% when pulling de103b0 on puremourning:unicode-investigation into 09f2164 on Valloric:master.

@puremourning
Copy link
Member Author

Review status: 25 of 44 files reviewed at latest revision, 5 unresolved discussions.


ycmd/request_wrap.py, line 102 [r2] (raw file):
Done.


Comments from Reviewable

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 84.738% when pulling 4d401e9 on puremourning:unicode-investigation into 09f2164 on Valloric:master.

@Valloric
Copy link
Member

Reviewed 8 of 12 files at r3, 4 of 4 files at r4, 7 of 7 files at r5.
Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@micbou
Copy link
Collaborator

micbou commented Apr 11, 2016

Reviewed 25 of 43 files at r1, 8 of 12 files at r3, 4 of 4 files at r4, 7 of 7 files at r5.
Review status: all files reviewed at latest revision, 50 unresolved discussions, some commit checks failed.


ycmd/identifier_utils.py, line 2 [r5] (raw file):
Should be above the copyright.


ycmd/request_wrap.py, line 2 [r5] (raw file):
Should be above the copyright.


ycmd/request_wrap.py, line 131 [r5] (raw file):
Maybe using pprint.pformat here would be better than using slashes.


ycmd/request_wrap.py, line 150 [r5] (raw file):
therefore typo.


ycmd/request_wrap.py, line 151 [r5] (raw file):
Two is.


ycmd/responses.py, line 74 [r5] (raw file):
Missing period.


ycmd/utils.py, line 2 [r5] (raw file):
Should be above the copyright.


ycmd/utils.py, line 141 [r5] (raw file):
Missing period.


ycmd/utils.py, line 147 [r5] (raw file):
This function should be the exact opposite of CodepointOffsetToByteOffset, that is:

def ByteOffsetToCodepointOffset( byte_line_value, byte_offset ):
  """..."""

  # Should be a no-op, but in case someone passes a unicode instance.
  byte_line_value = ToBytes( byte_line_value )

  return len( ToUnicode( byte_line_value[ : byte_offset - 1 ] ) ) + 1

 


ycmd/utils.py, line 159 [r5] (raw file):
Missing period.


ycmd/utils.py, line 161 [r5] (raw file):
Missing period.


ycmd/utils.py, line 369 [r5] (raw file):
behavior? equivalent, not equivelent.


ycmd/utils.py, line 376 [r5] (raw file):
behaviors?


ycmd/completers/completer.py, line 204 [r5] (raw file):
Missing period.


ycmd/completers/completer.py, line 375 [r5] (raw file):
Missing period.


ycmd/completers/completer.py, line 390 [r5] (raw file):
Missing period at the end of this comment and the other two below.


ycmd/completers/completer_utils.py, line 103 [r5] (raw file):
Missing period.


ycmd/completers/completer_utils.py, line 122 [r5] (raw file):
codepoint offsets instead of codepiont offsets and missing period too.


ycmd/completers/completer_utils.py, line 185 [r5] (raw file):
hass typo.


ycmd/completers/completer_utils.py, line 213 [r5] (raw file):
Missing period.


ycmd/completers/completer_utils.py, line 218 [r5] (raw file):
Missing period.


ycmd/completers/completer_utils.py, line 223 [r5] (raw file):
No space after """ (like above).


ycmd/completers/completer_utils.py, line 232 [r5] (raw file):
Missing period.


ycmd/completers/cpp/clang_completer.py, line 465 [r5] (raw file):
Should we use our SplitLines function here?


ycmd/completers/cs/cs_completer.py, line 512 [r5] (raw file):
Missing is.


ycmd/completers/cs/cs_completer.py, line 513 [r5] (raw file):
Missing period.


ycmd/completers/general/filename_completer.py, line 116 [r5] (raw file):
Missing space after :.


ycmd/completers/javascript/tern_completer.py, line 342 [r5] (raw file):
Missing period.


ycmd/tests/filename_completer_test.py, line 104 [r5] (raw file):
Spaces around =.


ycmd/tests/filename_completer_test.py, line 110 [r5] (raw file):
Spaces around =.


ycmd/tests/get_completions_test.py, line 1 [r5] (raw file):
Missing blank line.


ycmd/tests/get_completions_test.py, line 418 [r5] (raw file):
A underscore between each word is a little too much: GetCompletions_FilterThenReturnFromCache_test would be better.


ycmd/tests/request_wrap_test.py, line 106 [r5] (raw file):
StartColumn_UnicodeNotIdentifier_test?


ycmd/tests/request_wrap_test.py, line 222 [r5] (raw file):
Query_UnicodeSingleCharInclusive_test?


ycmd/tests/request_wrap_test.py, line 228 [r5] (raw file):
Query_UnicodeSingleCharExclusive?


ycmd/tests/test_utils.py, line 221 [r5] (raw file):
It would be simpler to assert the exception with its message:

def Wrapper( *args, **kwargs ):
  assert_that(
    calling( test ).with_args( *args, **kwargs ),
    raises( exception, message ) )
  raise nose.SkipTest( reason )

ycmd/tests/utils_test.py, line 2 [r5] (raw file):
Should be above the copyright.


ycmd/tests/utils_test.py, line 382 [r5] (raw file):
This comment and below ones: should start with a capital letter and missing period.


ycmd/tests/utils_test.py, line 452 [r5] (raw file):
All seems superfluous.


ycmd/tests/clang/get_completions_test.py, line 2 [r5] (raw file):
Should be above the copyright.


ycmd/tests/clang/get_completions_test.py, line 53 [r5] (raw file):
Missing period.


ycmd/tests/clang/get_completions_test.py, line 511 [r5] (raw file):
GetCompletions_UnicodeInline_test?


ycmd/tests/clang/get_completions_test.py, line 540 [r5] (raw file):
No matcher?


ycmd/tests/clang/get_completions_test.py, line 542 [r5] (raw file):
GetCompletions_UnicodeInlineFilter_test?


ycmd/tests/clang/subcommands_test.py, line 2 [r5] (raw file):
Should be above the copyright.


ycmd/tests/clang/subcommands_test.py, line 1085 [r5] (raw file):
Capital letter for unicode?


ycmd/tests/go/get_completions_test.py, line 2 [r5] (raw file):
Should be above copyright.


ycmd/tests/typescript/subcommands_test.py, line 2 [r5] (raw file):
Should be above the copyright.


ycmd/tests/typescript/subcommands_test.py, line 443 [r5] (raw file):
Spaces around =.


ycmd/tests/typescript/subcommands_test.py, line 490 [r5] (raw file):
Spaces around =.


Comments from Reviewable

@Valloric
Copy link
Member

Review status: all files reviewed at latest revision, 50 unresolved discussions, some commit checks failed.


ycmd/utils.py, line 376 [r5] (raw file):
Careful here, you're telling a Brit to use American English. :D I've heard that can produce murder sprees.

(I don't personally care. I accept both spellings.)


Comments from Reviewable

@puremourning
Copy link
Member Author

There's an issue with omnifunc completer: ycm-core/YouCompleteMe#2096 (comment)

It might be in YCM rather than here, but certainly it blocks merging :)


Review status: all files reviewed at latest revision, 50 unresolved discussions, some commit checks failed.


Comments from Reviewable

@puremourning
Copy link
Member Author

Review status: 27 of 44 files reviewed at latest revision, 50 unresolved discussions.


ycmd/identifier_utils.py, line 2 [r5] (raw file):
The spec says anywhere in the top 3 lines. But you obviously feel strongly about this (to me it is meh), so I've changed them all.


ycmd/request_wrap.py, line 2 [r5] (raw file):
Done.


ycmd/request_wrap.py, line 131 [r5] (raw file):
TBH this debugging has served its purpose, so I have removed it (this method is called often at performance critical times, so the extra cycles producing debug nobody is going to read is probably not worth it)


ycmd/request_wrap.py, line 150 [r5] (raw file):
Done.


ycmd/request_wrap.py, line 151 [r5] (raw file):
Done.


ycmd/responses.py, line 74 [r5] (raw file):
Done.


ycmd/utils.py, line 2 [r5] (raw file):
Done.


ycmd/utils.py, line 141 [r5] (raw file):
Done.


ycmd/utils.py, line 147 [r5] (raw file):
Done.


ycmd/utils.py, line 159 [r5] (raw file):
Done.


ycmd/utils.py, line 161 [r5] (raw file):
Done.


ycmd/utils.py, line 376 [r5] (raw file):
Heh. We've previously agreed to stick to colonial spellings only, so I changed it :). I need to set up my vimrc to detect that i'm working in YCM and set spelllang to en_us. Oh, and sell my soul while I'm at it :)


ycmd/completers/completer.py, line 204 [r5] (raw file):
Done.


ycmd/completers/completer.py, line 375 [r5] (raw file):
Done.


ycmd/completers/completer.py, line 390 [r5] (raw file):
Done.


ycmd/completers/completer_utils.py, line 103 [r5] (raw file):
Done.


ycmd/completers/completer_utils.py, line 122 [r5] (raw file):
Done.


ycmd/completers/completer_utils.py, line 185 [r5] (raw file):
Done.


ycmd/completers/completer_utils.py, line 213 [r5] (raw file):
Done.


ycmd/completers/completer_utils.py, line 218 [r5] (raw file):
Done.


ycmd/completers/completer_utils.py, line 223 [r5] (raw file):
Done.


ycmd/completers/completer_utils.py, line 232 [r5] (raw file):
Done.


ycmd/completers/cpp/clang_completer.py, line 465 [r5] (raw file):
In this instance it isn't necessary because our new utils.SplitLines is only required when we need to index into the output using the API's line_number. In this instance, we don't really care about the strangeness of the splitlines() call. Unless you think I missed something?


ycmd/completers/cs/cs_completer.py, line 512 [r5] (raw file):
Done.


ycmd/completers/cs/cs_completer.py, line 513 [r5] (raw file):
Done.


ycmd/completers/general/filename_completer.py, line 116 [r5] (raw file):
Done.


ycmd/completers/javascript/tern_completer.py, line 342 [r5] (raw file):
Done.


ycmd/tests/filename_completer_test.py, line 104 [r5] (raw file):
Done.


ycmd/tests/filename_completer_test.py, line 110 [r5] (raw file):
Done.


ycmd/tests/get_completions_test.py, line 1 [r5] (raw file):
Done.


ycmd/tests/get_completions_test.py, line 418 [r5] (raw file):
Done.


ycmd/tests/request_wrap_test.py, line 106 [r5] (raw file):
Done.


ycmd/tests/request_wrap_test.py, line 222 [r5] (raw file):
Done.


ycmd/tests/request_wrap_test.py, line 228 [r5] (raw file):
Done.


ycmd/tests/test_utils.py, line 221 [r5] (raw file):
Hmmm. That is better probably, but how strongly do you feel about this one? Reason I ask is that changing the API and retesting all the cases will be a pita and I feel like there are more pressing areas of concern :/


ycmd/tests/utils_test.py, line 2 [r5] (raw file):
Done.


ycmd/tests/utils_test.py, line 382 [r5] (raw file):
I think I am going to give up on writing comments ^_^.


ycmd/tests/utils_test.py, line 452 [r5] (raw file):
Done.


ycmd/tests/clang/get_completions_test.py, line 2 [r5] (raw file):
Done.


ycmd/tests/clang/get_completions_test.py, line 53 [r5] (raw file):
Done.


ycmd/tests/clang/get_completions_test.py, line 511 [r5] (raw file):
All of the tests I added had Unicode in them, which I thought made it clearer :(


ycmd/tests/clang/get_completions_test.py, line 540 [r5] (raw file):
Done.


ycmd/tests/clang/get_completions_test.py, line 542 [r5] (raw file):
Done.


ycmd/tests/clang/subcommands_test.py, line 2 [r5] (raw file):
Done.


ycmd/tests/clang/subcommands_test.py, line 1085 [r5] (raw file):
Done.


ycmd/tests/go/get_completions_test.py, line 2 [r5] (raw file):
Done.


ycmd/tests/typescript/subcommands_test.py, line 2 [r5] (raw file):
Done.


ycmd/tests/typescript/subcommands_test.py, line 443 [r5] (raw file):
Done.


ycmd/tests/typescript/subcommands_test.py, line 490 [r5] (raw file):
Done.


Comments from Reviewable

@puremourning
Copy link
Member Author

Review status: 26 of 44 files reviewed at latest revision, 50 unresolved discussions.


ycmd/utils.py, line 369 [r5] (raw file):
Done.


Comments from Reviewable

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 84.76% when pulling 24f0f21 on puremourning:unicode-investigation into 0e230f8 on Valloric:master.

Change lots of places to work with chars or bytes correctly

Add lots of comments and TODOs. Vain (and broken) attempt to fix tern
renames
- Add tests for to/from byte offset
- Add tests for RefactorRename javascript
- Test unicode with gocode end-to-end. Send the start_column rather than the current column so that ycmd matching is used
- Add missing c-sharp test file
- Upgrade @expectedfailure to support matching the exception raised. Use it to show that identifier completer is busted
@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 84.723% when pulling 417a7fe on puremourning:unicode-investigation into 30a4524 on Valloric:master.

@vheon
Copy link
Contributor

vheon commented Apr 24, 2016

:lgtm: Great work!! I've only spotted a typo.


Reviewed 18 of 43 files at r1, 4 of 12 files at r3, 2 of 4 files at r4, 1 of 7 files at r5, 14 of 17 files at r7, 1 of 1 files at r8, 2 of 2 files at r9, 2 of 2 files at r10.
Review status: all files reviewed at latest revision, 1 unresolved discussion.


ycmd/tests/cs/get_completions_test.py, line 57 [r10] (raw file):
I believe this is GetCompletions_...


Comments from Reviewable

…dd behaviour for empty string and the string containing only a newline

Additional tidying up:
- Remove ToHex which was for debugging only
- Fix windows test failures - always return a proper path
- Correct many typographical errors and change Query calculation to work in codepoints not bytes, for consistency and clarity.
- Add test and explanation for the deep copy of candidates
- Remove debug code
- Minor typo corrections
- Fix TypeScript unicode tests on Windows
	TSServer adds a newline at the end of the response message and counts
	it as one character (\n) towards the content length. But newlines are
	two characters on Windows (\r\n). To take care of that, the
	universal_newlines option is set when starting the TSServer
	subprocess. This option automatically converts Windows newlines to \n.
	However, it also opens the stdin, stdout, and stderr as text streams
	instead of binary ones. This does not work properly with unicode
	characters on Windows and Python 3. Therefore, we directly increment
	the content length if we are on Windows instead of using the
	universal_newlines option.
- Update unicode tests
	These tests don't raise an IndexError exception anymore but return
	an empty list of candidates instead.
- Remove unnecessary logging objects/imports
@puremourning
Copy link
Member Author

Review status: 43 of 44 files reviewed at latest revision, 1 unresolved discussion.


ycmd/tests/cs/get_completions_test.py, line 57 [r10] (raw file):
Done.


Comments from Reviewable

@coveralls
Copy link

coveralls commented Apr 24, 2016

Coverage Status

Coverage increased (+15.4%) to 100.0% when pulling 7330fa0 on puremourning:unicode-investigation into 30a4524 on Valloric:master.

@Valloric
Copy link
Member

Sweet Jesus, is this actually landing? :D

Awesomesauce!

@homu r+


Review status: 43 of 44 files reviewed at latest revision, 1 unresolved discussion.


Comments from Reviewable

@homu
Copy link
Contributor

homu commented Apr 24, 2016

📌 Commit 7330fa0 has been approved by Valloric

@homu
Copy link
Contributor

homu commented Apr 24, 2016

⚡ Test exempted - status

@homu homu merged commit 7330fa0 into ycm-core:master Apr 24, 2016
homu added a commit that referenced this pull request Apr 24, 2016
[READY] Fix issues with multi-byte characters

## Summary

This change introduces more general support for non-ASCII characters in buffers handled by YCMD.

In ycmd's public API, all offsets are byte offsets into the UTF-8 encoded buffers. We also assume (because, we have no other choice) that files stored on disk are also UTF-8 encoded. Internally, almost all of ycmd's functionality operates on unicode strings (python 2 `unicode()` and python 3 `str()` objects, transparently via `future`). Many of the downstream completion engines expect unicode code points as the offsets in their APIs. One special case is the `ycm_core` library (identifier completer and clang completer), which requires instances of the _native_ `str` type. All strings used within the c++ using `boost::python` require passing through `ToCppStringCompatible`

Previously, we were largely just assuming that `code point == byte offset` - i.e. all buffers contained only ASCII characters. This worked up to a point, but more by luck than judgement in a number of places.

## References

In combination with a YCM change and PR #453, I hope this:

- fixes #109
- fixes ycm-core/YouCompleteMe#2096
- fixes ycm-core/YouCompleteMe#2088
- fixes ycm-core/YouCompleteMe#2069
- fixes ycm-core/YouCompleteMe#2066
- fixes ycm-core/YouCompleteMe#1378

## Overview of changes

The changes fall into the following areas:

- Providing access to and conversion to/from code points and byte offsets (`request_wrap.py`)
- Changing certain algorithms/features to work entirely in codepoint space when they are trying to operate on logical 'characters' within the buffer (see known issues for why this isn't perfect, but probably most of the way there)
- Changing the completers to convert between the external (on both sides) and internal representations by using the shortcuts provided in `request_wrap.py`
- Adding tests for each of the completers for both completions and subcommands

## Completer-specific notes

Pretty much all of the completers I tested required some changes:
- clang uses utf-8 and byte offsets, but had some bugs with the `GetDoc` parsing stuff
- OmniSharp speaks codepoint offsets
- Tern speaks codepoint offsets
- JediHTTP speaks codepoint offsets
- tsserver speaks codepoint offsets
- gocode speaks byte offsets
- racer i did not test

## Further work / Known issues

- we act blissfully ignorant of the case where a unicode character consumes multiple code points (such as where there is a modifier after the code point)
- when typing a unicode character, we still get an exception from `bitset` (see #453 for that fix)
- the filtering and sorting system is 100% designed for ASCII only, and it is not in the scope of this PR to change that. Currently after any filtering operation, words containing non-ASCII characters are excluded.
- I did not get round to testing rust using racer
- there are further changes required to YouCompleteMe client (a further PR is coming for that)

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/455)
<!-- Reviewable:end -->
homu added a commit to ycm-core/YouCompleteMe that referenced this pull request May 8, 2016
[READY] Fixes for multi-byte errors

# PR Prelude

Thank you for working on YCM! :)

**Please complete these steps and check these boxes (by putting an `x` inside
the brackets) _before_ filing your PR:**

- [X] I have read and understood YCM's [CONTRIBUTING][cont] document.
- [X] I have read and understood YCM's [CODE_OF_CONDUCT][code] document.
- [X] I have included tests for the changes in my PR. If not, I have included a
  rationale for why I haven't.
- [X] **I understand my PR may be closed if it becomes obvious I didn't
  actually perform all of these steps.**

# Why this change is necessary and useful

There are a number of recent errors with unicode (most of which caused by the server, see PR ycm-core/ycmd#455. In testing I fixed a number of client-side tracebacks also.

This is by no means a comprehensive set of fixes for the client - I have simply fixed those that I came across in testing.

Summary:
 - fixes for errors when typing in c-sharp files due to the completion done handler
 - fixes for FixIts to apply correctly with multi-byte characters
 - fixes for unicode characters in return from the omni completer

[cont]: https://github.com/Valloric/YouCompleteMe/blob/master/CONTRIBUTING.md
[code]: https://github.com/Valloric/YouCompleteMe/blob/master/CODE_OF_CONDUCT.md

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/youcompleteme/2108)
<!-- Reviewable:end -->
@puremourning puremourning deleted the unicode-investigation branch November 1, 2016 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment