New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False Positive: Source file is not valid UTF-8 #57

Closed
DaanDeMeyer opened this Issue Nov 19, 2017 · 14 comments

Comments

Projects
None yet
7 participants
@DaanDeMeyer
Contributor

DaanDeMeyer commented Nov 19, 2017

I sometimes get a 'source file is not valid UTF-8' error when editing C++ code with cquery. Haven't been able to find out how to reproduce it.

a

Commenting/uncommenting line doesn't remove the error. Removing and pasting all the code in the file does remove the error

@topisani

This comment has been minimized.

Contributor

topisani commented Nov 19, 2017

I get this too, but only in vsc, never in emacs

@DaanDeMeyer

This comment has been minimized.

Contributor

DaanDeMeyer commented Nov 19, 2017

I think I've reproduced it.

It happens whenever I type in a character such as 'à' or 'é' by accident.

@jacobdufault

This comment has been minimized.

Member

jacobdufault commented Nov 19, 2017

What happens if you compile the file using clang after having typed the character? Does it complain?

@DaanDeMeyer

This comment has been minimized.

Contributor

DaanDeMeyer commented Nov 19, 2017

Clang compiles without warnings or errors

@agauniyal

This comment has been minimized.

Contributor

agauniyal commented Nov 20, 2017

depends on which version is being used to compile, it could be clang 5 since this plugin is using 4.

@topisani

This comment has been minimized.

Contributor

topisani commented Nov 20, 2017

its definately a bug, it compiles fine, and as mentioned, only happens in the vscode client.

@MaskRay

This comment has been minimized.

Member

MaskRay commented Dec 8, 2017

Fixed?

@jhasse

This comment has been minimized.

Contributor

jhasse commented Dec 11, 2017

Fixed?

Still happens for me when using umlauts in literals for example.

@MaskRay

This comment has been minimized.

Member

MaskRay commented Jan 1, 2018

Example source file?

@Riatre

This comment has been minimized.

Contributor

Riatre commented Jan 11, 2018

I can confirm that this still happens. How to reproduce:

  1. Open a source file with VSCode client.
  2. Type in some non-ASCII character (in my case, "(", though anything non single byte should work).
  3. cquery reports "source file is not valid UTF-8", which is unexpected, but reasonable.
  4. Delete that character, cquery still reports "source file is not valid UTF-8", which is unexpected. Reload the entire VSCode fixes this.

My guess is it might be a bug in vscode client.

@topisani

This comment has been minimized.

Contributor

topisani commented Jan 11, 2018

My guess is it might be a bug in vscode client.

nope, happens in emacs too - but often it works for me to just delete the line i entered a bad char on and paste it back (saving for reindex in between)

@Riatre

This comment has been minimized.

Contributor

Riatre commented Jan 11, 2018

Seems like this is caused by the fact that all the offsets and lengths in Language Server Protocol is given as the amount of UTF-8 characters UTF-16 code units instead of bytes, and we treat it as bytes when updating WorkingFile.buffer_content. (src/working_files.cc:333)

Sounds difficult to fix without introducing UTF-16 aware std::string indexing codes.

Edit: Nope, it's not UTF-8, it's UTF-16 as per specification. But TextDocumentContentChangeEvent.text is sent in UTF-8. (╯‵□′)╯︵┻━┻

@Riatre

This comment has been minimized.

Contributor

Riatre commented Jan 11, 2018

And what's worse, lsp-mode have no idea about these UTF-16 things, so positions coming from Emacs would be in UTF-8 characters.

Maybe we could use an UTF-8 iterator of std::string for working on buffer_content, this still breaks whenever there are 4-byte UTF-8 characters (as at this point Visual Studio Code disagrees with lsp-mode on how many "characters" there are). But being unable to insert emoji should be less annoying than having to restart cquery after accidentally typed in à or (.

MaskRay added a commit that referenced this issue Jan 13, 2018

@MaskRay MaskRay added the vscode label Jan 13, 2018

@MaskRay

This comment has been minimized.

Member

MaskRay commented Jan 13, 2018

Thank @Riatre for troubleshooting. Emacs lsp-mode is good now. Don't use emojis 😿 in VSCode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment