Skip to content
This repository has been archived by the owner on Sep 6, 2021. It is now read-only.

Brackets incorrectly identify a file as being non-encoded in utf-8 and refuse to open it #11525

Closed
aroberge opened this issue Jul 31, 2015 · 13 comments

Comments

@aroberge
Copy link

I have a very simple json file whose entire content is

{ }

Bracket refuses to open it, claiming it is not a utf-8 encoded file. Notepad++ identifies it as being a utf-8 encoded file (with no BOM).

@abose
Copy link
Contributor

abose commented Aug 1, 2015

could you upload the file here.

@aroberge
Copy link
Author

aroberge commented Aug 1, 2015

@abose Apparently I can not upload the file; however here is a link to it on github:
https://github.com/aroberge/reeborg/blob/master/src/worlds/empty_world.json

@petetnt
Copy link
Collaborator

petetnt commented Aug 1, 2015

Did some testing, any file that:

  • Starts with any unicode char from 0061 to 007D (a to })
  • Then has a space
  • Ends with any unicode char from 0061 to 007D (a to })

fails to be recognized as UTF-8. Adding any other characters (including new lines etc.) except spaces or converting the file to UTF8 with BOM fixes the problem.

Some test cases (take a line, remove the comment, save as anyfile.anyext UTF without BOM)

a a// fails
a z// fails
{ }// fails
a }// fails
a |// fails
` }// works
a `// works
} a// fails
} a           //fails

@marcelgerber
Copy link
Contributor

@petetnt Thanks for looking into this very weird issue.
It looks like this is Win only as it works just fine in my Linux VM.

@marcelgerber
Copy link
Contributor

It's the IsTextUnicode call that returns true when it shouldn't.
Specifically, it's the IS_TEXT_UNICODE_STATISTICS flag (included in the IS_TEXT_UNICODE_UNICODE_MASK flag) that causes this issue.

According to MSDN:

The IS_TEXT_UNICODE_STATISTICS [...] tests use statistical analysis. These tests are not foolproof. The statistical tests assume certain amounts of variation between low and high bytes in a string, and some ASCII strings can slip through. For example, if lpv indicates the ASCII string 0x41, 0x0A, 0x0D, 0x1D (A\n\r^Z), the string passes the IS_TEXT_UNICODE_STATISTICS test, although failure would be preferable.

I guess we're seeing some sort of this bug.
Maybe @JeffryBooher @nethip?

@marcelgerber
Copy link
Contributor

Btw: Would Brackets save it's files with a BOM, this would not be an issue at all...

@nethip
Copy link
Contributor

nethip commented Aug 3, 2015

@marcelgerber Let me do some reading on this. I will get back to you. Thanks!

@petetnt
Copy link
Collaborator

petetnt commented Aug 3, 2015

@marcelgerber RE: Saving files with BOM,

When I was looking into this I noticed that UTF-8 & BOM has been under discussion for a long time in #3898 (and others such as #10583) plus there is this card on Trello too: https://trello.com/c/I5sgI4SV/1164-editor-and-bom-byte-order-mark

@nethip
Copy link
Contributor

nethip commented Aug 6, 2015

@marcelgerber You were right about the IsTextUnicode Win32 call that could be the possible reason for the failure. The thing is, detecting the encoding of a file, without BOM, is a very difficult task and is not fool proof. I checked file encoding detection code in Dreamweaver and it seems very complex.

@petetnt Thanks for trying out various steps in nailing down the problem. Saving with BOM seems to be a good idea. Unfortunately, that is not a recommended way.

By the way, we have already started to think about what are the best ways to support various encodings. I will keep you updated on that. Thanks!

@marcelgerber
Copy link
Contributor

@nethip Notepad++ has a good encoding detection, which is not ultra-long at least (it's not by any means comprehensble, but I guess that's simply the nature of encodings):
https://github.com/notepad-plus-plus/notepad-plus-plus/blob/85c728573e0c81ff9df7a1adf4b1934fb01661e7/PowerEditor/src/Utf8_16.cpp#L182-L232 (plus a ASCII/UTF8 distinction a few lines above).

As it's all licensed under the GPL, we could completely reuse their implementation.

@petetnt
Copy link
Collaborator

petetnt commented Aug 7, 2015

Can't comment on Notepad++'s method from a technical perspective, but in practice N++ has worked as an general "solve this encoding problem" workhorse for me for years. 👍

@nethip
Copy link
Contributor

nethip commented Aug 7, 2015

@marcelgerber Thanks for the pointer. It looks like a good place to start this activity. Hopefully, we will be able to map Win32 calls in the repo to their Mac and Linux equivalents.

@marcelgerber
Copy link
Contributor

Yes, they have a solid detection rate.

A working encoding detection is also the foundation for multi-encoding support, so it doesn't interfere with those efforts at all.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants
@aroberge @nethip @marcelgerber @abose @saurabh95 @petetnt and others