Brackets incorrectly identify a file as being non-encoded in utf-8 and refuse to open it #11525

aroberge · 2015-07-31T17:00:58Z

I have a very simple json file whose entire content is

{ }

Bracket refuses to open it, claiming it is not a utf-8 encoded file. Notepad++ identifies it as being a utf-8 encoded file (with no BOM).

The text was updated successfully, but these errors were encountered:

abose · 2015-08-01T08:45:34Z

could you upload the file here.

aroberge · 2015-08-01T10:22:50Z

@abose Apparently I can not upload the file; however here is a link to it on github:
https://github.com/aroberge/reeborg/blob/master/src/worlds/empty_world.json

petetnt · 2015-08-01T11:03:47Z

Did some testing, any file that:

Starts with any unicode char from 0061 to 007D (a to })
Then has a space
Ends with any unicode char from 0061 to 007D (a to })

fails to be recognized as UTF-8. Adding any other characters (including new lines etc.) except spaces or converting the file to UTF8 with BOM fixes the problem.

Some test cases (take a line, remove the comment, save as anyfile.anyext UTF without BOM)

a a// fails
a z// fails
{ }// fails
a }// fails
a |// fails
` }// works
a `// works
} a// fails
} a           //fails

marcelgerber · 2015-08-01T12:12:18Z

@petetnt Thanks for looking into this very weird issue.
It looks like this is Win only as it works just fine in my Linux VM.

marcelgerber · 2015-08-03T10:55:59Z

It's the IsTextUnicode call that returns true when it shouldn't.
Specifically, it's the IS_TEXT_UNICODE_STATISTICS flag (included in the IS_TEXT_UNICODE_UNICODE_MASK flag) that causes this issue.

According to MSDN:

The IS_TEXT_UNICODE_STATISTICS [...] tests use statistical analysis. These tests are not foolproof. The statistical tests assume certain amounts of variation between low and high bytes in a string, and some ASCII strings can slip through. For example, if lpv indicates the ASCII string 0x41, 0x0A, 0x0D, 0x1D (A\n\r^Z), the string passes the IS_TEXT_UNICODE_STATISTICS test, although failure would be preferable.

I guess we're seeing some sort of this bug.
Maybe @JeffryBooher @nethip?

marcelgerber · 2015-08-03T11:07:52Z

Btw: Would Brackets save it's files with a BOM, this would not be an issue at all...

nethip · 2015-08-03T11:41:17Z

@marcelgerber Let me do some reading on this. I will get back to you. Thanks!

petetnt · 2015-08-03T12:01:46Z

@marcelgerber RE: Saving files with BOM,

When I was looking into this I noticed that UTF-8 & BOM has been under discussion for a long time in #3898 (and others such as #10583) plus there is this card on Trello too: https://trello.com/c/I5sgI4SV/1164-editor-and-bom-byte-order-mark

nethip · 2015-08-06T19:35:56Z

@marcelgerber You were right about the IsTextUnicode Win32 call that could be the possible reason for the failure. The thing is, detecting the encoding of a file, without BOM, is a very difficult task and is not fool proof. I checked file encoding detection code in Dreamweaver and it seems very complex.

@petetnt Thanks for trying out various steps in nailing down the problem. Saving with BOM seems to be a good idea. Unfortunately, that is not a recommended way.

By the way, we have already started to think about what are the best ways to support various encodings. I will keep you updated on that. Thanks!

marcelgerber · 2015-08-06T20:04:57Z

@nethip Notepad++ has a good encoding detection, which is not ultra-long at least (it's not by any means comprehensble, but I guess that's simply the nature of encodings):
https://github.com/notepad-plus-plus/notepad-plus-plus/blob/85c728573e0c81ff9df7a1adf4b1934fb01661e7/PowerEditor/src/Utf8_16.cpp#L182-L232 (plus a ASCII/UTF8 distinction a few lines above).

As it's all licensed under the GPL, we could completely reuse their implementation.

petetnt · 2015-08-07T05:28:34Z

Can't comment on Notepad++'s method from a technical perspective, but in practice N++ has worked as an general "solve this encoding problem" workhorse for me for years. 👍

nethip · 2015-08-07T05:42:02Z

@marcelgerber Thanks for the pointer. It looks like a good place to start this activity. Hopefully, we will be able to map Win32 calls in the repo to their Mac and Linux equivalents.

marcelgerber · 2015-08-07T11:15:03Z

Yes, they have a solid detection rate.

A working encoding detection is also the foundation for multi-encoding support, so it doesn't interfere with those efforts at all.

marcelgerber added the Win only label Aug 1, 2015

marcelgerber added the native shell label Aug 3, 2015

petetnt mentioned this issue Sep 23, 2015

open a normal UTF-8 file failed ! #11730

Closed

petetnt mentioned this issue Dec 9, 2015

Integrate an encoding plugin #11986

Closed

petetnt mentioned this issue Jan 14, 2016

Brackets should open non-UTF-8 files. #12083

Closed

petetnt mentioned this issue Dec 25, 2016

Can't open a file! #13004

Closed

saurabh95 mentioned this issue Jun 24, 2017

Now BOM is preserved for UTF-8 files #13477

Merged

saurabh95 closed this as completed Jun 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brackets incorrectly identify a file as being non-encoded in utf-8 and refuse to open it #11525

Brackets incorrectly identify a file as being non-encoded in utf-8 and refuse to open it #11525

aroberge commented Jul 31, 2015

abose commented Aug 1, 2015

aroberge commented Aug 1, 2015

petetnt commented Aug 1, 2015

marcelgerber commented Aug 1, 2015

marcelgerber commented Aug 3, 2015

marcelgerber commented Aug 3, 2015

nethip commented Aug 3, 2015

petetnt commented Aug 3, 2015

nethip commented Aug 6, 2015

marcelgerber commented Aug 6, 2015

petetnt commented Aug 7, 2015

nethip commented Aug 7, 2015

marcelgerber commented Aug 7, 2015

Brackets incorrectly identify a file as being non-encoded in utf-8 and refuse to open it #11525

Brackets incorrectly identify a file as being non-encoded in utf-8 and refuse to open it #11525

Comments

aroberge commented Jul 31, 2015

abose commented Aug 1, 2015

aroberge commented Aug 1, 2015

petetnt commented Aug 1, 2015

marcelgerber commented Aug 1, 2015

marcelgerber commented Aug 3, 2015

marcelgerber commented Aug 3, 2015

nethip commented Aug 3, 2015

petetnt commented Aug 3, 2015

nethip commented Aug 6, 2015

marcelgerber commented Aug 6, 2015

petetnt commented Aug 7, 2015

nethip commented Aug 7, 2015

marcelgerber commented Aug 7, 2015