Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INI file format not further specified #97

Closed
hakre opened this issue Apr 24, 2013 · 33 comments
Closed

INI file format not further specified #97

hakre opened this issue Apr 24, 2013 · 33 comments

Comments

@hakre
Copy link

hakre commented Apr 24, 2013

Current documentation on the editor config file format has the following founding sentence:

EditorConfig files use an INI file format in which section names are globs matching filenames in a format accepted by the fnmatch C library.

However the INI file format remains undefined. What is the specification this relates to? What stands INI for?

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@sindresorhus
Copy link
Member

@hakre
Copy link
Author

hakre commented Apr 24, 2013

@sindresorhus: The page you've linked can be edited by anybody so I have my problems to see how these many authors can speak for editor config. Apart from that, it is written that it is an informal standard and that is has different implementations. E.g. do you share the specifics that are outlined about case sensitivity on the Wikipedia page? As you can see, pointing over to there is not very helpful in learning specifics about this editor config INI file-format.

Another quote from that page:

The INI file format is not well defined

@treyhunner
Copy link
Member

@hakre yes the term "INI" is undefined. The term is loose, but I think it conveys the general concept fairly well for a user of .editorconfig files, though not necessarily well enough for a developer.

I do think we should define a formal grammar for the EditorConfig file format. Currently our tests check some grammar features among the various parsers, but they're certainly not nearly as complete (nor as provably correct) as a grammar.

Here's a relevant mailing list thread: https://groups.google.com/forum/?fromgroups=#!searchin/editorconfig/grammar/editorconfig/ZBd9S4uGsWQ/lBMPR5smE9wJ. I just made an issue for this also: gh-98.

@xuhdev
Copy link
Member

xuhdev commented Apr 25, 2013

FYI, the current ini parser is based on inih with slight modification.

@hakre
Copy link
Author

hakre commented Apr 25, 2013

@treyhunner Sorry but if you address something - and that's how I understand it - to peoples who have communication problems over tabs vs. spaces, well, do you really think you can go much far w/o a grammar at all? :D Thanks for bringing this up not only in the google groups but also with an issue ticket.


For me as a developer who needs to decide whether to distribute .editorconfig or not - and sorry just placing "helpful" and "awesome" on some website is not enough for me - one of the things I first check are the specs and how thoughtful they deem to me. Not having a grammar and being imprecise about the file-encoding is a bummer in my eyes.

I try to address these problems over the last hours, however I see this project still is in a very alpha stage so I would think a dead-serious review would not be fair.

Have you made any plans to announce a stable version and report errata from there on or something like a more formal procedure?

Also what was the reason to not formulate a grammar in the first place? Can you share? I ask because some projects might find this intimidating which is not what I aim for.

I know that creating the grammar is a lot of work (probably more than getting some software to run) and needs much thought by one or more dedicated persons (the grammar is only a result). But on the other hand I see that this project decided to opt for an INI format so I suppose it knows what the ground is it places the ball down to play and I must wonder that both points - formal description of the format and file-encoding appears so less thought about.

Also from the work that is done here that I could see in this quick period of time I'm looking; there is not only formulating a file-format but you also to find terms and types to describe the different editor configuration options. I would further suggest to separate this process from the plain INI file-grammar so that it is possible to express a configuration in a different file-format as well. E.g. to embed it with RDF. I can imagine that his will be helpful for your project regardless of the problems with the file-format. That kind of foundation would never be lost.

Also if you define the terms and values case-insensitive (maybe in the style how CSS defines case-insensitivity), there should not be many problems with the implementations (e.g. file matching on a windows system is case insensitive, section and property-names in INI files are case insensitive etc.).

Just my 2 cents.

@treyhunner
Copy link
Member

Have you made any plans to announce a stable version and report errata from there on or something like a more formal procedure?

We wanted to specify a grammar, ensure the core libraries agree on the file format, and possibly solidify support for some more widely-used editors first. We'd then release a 1.0 and call it stable.

My focus has drifted from this project over the past year. I don't spend nearly as much time working on this project as I used to. I'd appreciate any help we can get to make a good 1.0 release.

Also what was the reason to not formulate a grammar in the first place? Can you share? I ask because some projects might find this intimidating which is not what I aim for.

As you've already said it's a lot of work. I made this project originally to see if it could even work. @xuhdev and I were the only developers (and users) for months until @sindresorhus and @swansontec and then many others jumped in to help in various ways. No one is getting paid to do this so I assume everyone naturally works on a mix of what they think is important and what they enjoy (same way most open source projects work).

I eventually took a stab at writing a grammar as you've seen already, but it's the first grammar I've written and I currently lack the ability to validate and fix it.

You're the first to ask for a grammar. Would you care to help us on this front? I assume you are more knowledgeable about creating formal grammar specs than I am.

@treyhunner
Copy link
Member

To address your other points...

I would further suggest to separate this process from the plain INI file-grammar so that it is possible to express a configuration in a different file-format as well. E.g. to embed it with RDF. I can imagine that his will be helpful for your project regardless of the problems with the file-format. That kind of foundation would never be lost.

Since the format is mostly just key/value pairs under sections it would be trivial to represented the same data using XML, JSON, or many other formats. I've never dealt with an RDF before so I'm not certain what you mean though.

Also if you define the terms and values case-insensitive (maybe in the style how CSS defines case-insensitivity), there should not be many problems with the implementations (e.g. file matching on a windows system is case insensitive, section and property-names in INI files are case insensitive etc.).

Since filenames tend to be case-sensitive on many other operating systems I actually think we should favor case sensitivity. Otherwise it would be impossible to create a glob matching test.sh but not Test.sh.

I don't often use Windows so I don't think I'm in a good place to make this decision though. I would like to hear what other Windows users think about this issue.

@xuhdev
Copy link
Member

xuhdev commented Apr 25, 2013

@hakre I think you come up with the syntax thing because you are more leaning to see EditorConfig as a language, while we are still seeing it on a configuration file level. Yes, you're right, for INI such a loose term, we should have defined it somewhere that public could find. But for most people, it's enough to tell them it is an INI file -- There is very few differences between different INI variants, and that's probably why no one has complaint about the syntax. But, yes, internally we have an agreement on the syntax -- See my comment on #98. And also you could find test cases for the file format here, which are not complete though.

And for your previous comment:

The page you've linked can be edited by anybody so I have my problems to see how these many authors can speak for editor config. Apart from that, it is written that it is an informal standard and that is has different implementations. E.g. do you share the specifics that are outlined about case sensitivity on the Wikipedia page? As you can see, pointing over to there is not very helpful in learning specifics about this editor config INI file-format.

First, wikipedia has references on most pages. Follow them if you need precise information from a solid source.

Secondly, again, sorry, the case sensitivity is only published here currently. I'll copy this important information to homepage.

@hakre
Copy link
Author

hakre commented Apr 26, 2013

@treyhunner: Sounds fair, here is what I can throw into the ring:

  • Write my own implementation to find out about which Issues I see.
  • Suggest a grammar formulation based on it.
  • You should take care of the dictionary of the key -> value pairs the standard has, I suggest to have this apart from the INI file format.

For the implementation I would focus as much as possible on interoperability so to make all right-now common INI usage possible. To not let this all become to unspecific, there is a suggested / "should" way of writing/encoding the editor-config file. Problems in parsing the file, e.g. missing / different syntax elements or similar characters will be tried to recover, however all these cases are highlighted so a decision can be done whether to support this or not, which error / warning to give. Again I would suggest to support as much as possible, however to give notices, warnings and parse errors. A well written INI file then should not produce any notices, but unless the parser does not give any errors, the file at least was parsed.

@xuhdev: What you say about the references on Wikipedia is actually right, however for my excuse I have to point out that not any of these reference links has been presented to me but the Wikipedia page itself and I guess we are all old enough here to know that Wikipedia itself is not, can not and does not wants to be a solid source. It is a living encyclopedia edited by millions. And in this case it reminded me that there is a problem with the file format, so it has some use, especially in bringing the problem of the INI file format to light.

One solution I see in how to deal with the INI file format related definition problems is interoperability. I could also see this in the various discussions that at least from some folks here is considered as a way to approach the format problem. For example to suggest a specific was how to write the file but also to be more generously in parsing an input file. Like * "Be conservative in what you send, be liberal in what you accept")* in ROBUSTNESS.

Also I do not think that the only problem so far is to have the documentation internally only. Sure this can be a facet of the problem, but this is more showing something than the actual root-cause in my eyes. We can not remove the problems by just saying this was a documentation problem if it wasn't (only).

I'll write some code up and see what else comes to mind. And the problem about case sensitivity is much larger then not being fully outlined in the docs. See the discussion it has about the case-sensitivity of the sections keys. Right now I have no clue how to keep that interoperable because as @treyhunner lined out, on unixoide systems we do not have case-sensitivity in the file-system but on windows we have. Speaking of interoperability, one strategy here could be to use the least-common-denominator. That is having those not case-sensitive. I can understand that if you're a linux or macosx user, you might not like/prefer that, but on the other hand from the software development projects I know, for a file-matching pattern, this always (if not even was and forever will be) not a cause of problems simply because there is no need for case-sensitivity normally for those glob patterns (proof me wrong here please). But all I can ask for is that you review this yourself for the code-bases you know. If you could live with having the INI so far, I would also say, it's even less burdensome to live with case-insensitivity for the patterns then continuing with the current status quo of the non-existant grammar/encoding.

Here again, a special mode of operation could give notices if a pattern does not match the case.

Also by case-insentivity I have US-ASCII case-insensitivity in mind, e.g. only [a-z] and [A-Z] are equivalent. So no case information from the Unicode repository such as case-folding and special-case. Compare with case-insensivity in CSS: W3-CSS-CASE; CSS21-CHARACTERS - This should cover the real-life file/pattern cases well as well as the key and value strings that are used by editor-config.

Sorry for text-walling, tried to outline my thoughts, I'm sure it still has some rough edges but I hope this feedback is useful to base some work upon.

@treyhunner
Copy link
Member

Thank you for offering to help us out @hakre. I should be able to handle the dictionary of key/value pair definitions. I think we may want to manage two lists: one with very simple specifications that are most easily understood for the average end user and one with more rigorous descriptions. The first one is already on the website (though it is probably in need of rewording) and the second hasn't been made yet.

For the implementation I would focus as much as possible on interoperability so to make all right-now common INI usage possible. To not let this all become to unspecific, there is a suggested / "should" way of writing/encoding the editor-config file.

I usually prefer "only one way to do it" but I think this approach is a good idea in this case. @sindresorhus expressed similar sentiment in mailing list discussions we've had before. I have been using ; for comments in INI files (strangely I guess since I usually use Python) and = for assignment and some others prefer # for comments and : for assignment.

One solution I see in how to deal with the INI file format related definition problems is interoperability. I could also see this in the various discussions that at least from some folks here is considered as a way to approach the format problem. For example to suggest a specific was how to write the file but also to be more generously in parsing an input file. Like * "Be conservative in what you send, be liberal in what you accept")* in ROBUSTNESS.

I like the way CSS recovers from error handling. I think this might be a good idea for EditorConfig since a missing = sign or ] under one section shouldn't need to affect another section. Not all tools that parse CSS recover in the same way a browser does (some minifiers just break on certain errors) but that's fine because while it's important for the browser should try its best to keep going a developer can note a problem early on from their tools that are more strict about the format.

Also I do not think that the only problem so far is to have the documentation internally only. Sure this can be a facet of the problem, but this is more showing something than the actual root-cause in my eyes.

I agree. There are definitely subtle inconsistencies in some areas where documentation is missing and possibly less subtle bugs. Hopefully this process will help us identify and fix or at least document these problems.

Speaking of interoperability, one strategy here could be to use the least-common-denominator. That is having those not case-sensitive. I can understand that if you're a linux or macosx user, you might not like/prefer that, but on the other hand from the software development projects I know, for a file-matching pattern, this always (if not even was and forever will be) not a cause of problems simply because there is no need for case-sensitivity normally for those glob patterns (proof me wrong here please).

I've made issue #100 so it can be discussed openly. I may post a message on Twitter asking others to chime in with their thoughts. Feel free to repost your thoughts on the issue I just created: #100.

@sindresorhus
Copy link
Member

I previously suggested adding support for YAML as INI was clearly ambigious.

@hakre
Copy link
Author

hakre commented May 3, 2013

@treyhunner: can you give some pointers what has been said (if anything) for duplicate fnmatch patterns? should sections with the same patterns be merged? You wrote something fallback as if in CSS, so I would say the last rule wins and yes, same-sections merge based on last property value wins.

[*.txt]
propertya = value1
propertyb = value1

[*]
propertya = value2

[*.txt]
propertya = value3

(what happens with all these values?)

@hakre hakre closed this as completed May 3, 2013
@hakre hakre reopened this May 3, 2013
@xuhdev
Copy link
Member

xuhdev commented May 3, 2013

For duplicated properties the last ones win. Others merge.

@treyhunner
Copy link
Member

@hakre basically the last one wins. So in that case something.txt would give you:

propertya = value3
propertyb = value1

If you removed the last section propertya would be value2 instead.

This is how section parsing currently works:

  1. Files are read from the root of the directory downward
  2. Files are read from top to bottom
  3. The properties from every section that matches the file should be used
  4. The values for duplicate property names should come from the last read value

@hakre
Copy link
Author

hakre commented May 6, 2013

After chewing on this for the last week, I see the following picture to specify a disinct format and parsing of the editor-config-ini file.

On the birds-view I see the following three parts for the file:

  • Bytestream Normalization
  • File-Format
  • Parser

This is only for the file-format.

My consideration are with interoperability in mind. Because of the division into different parts, it is easily possible to modify the parameters to tailor the format if a specific flavor is preferred (e.g. supported line-separators, allowed characters to create comment etc.).

One aspect which was not yet part of the discussion is how editor-plugins must behave when a user uses the editor so save an .editorconfig file to disk. Because this is an interactive process (the common eat-your-own-dogfood situation) warnings and notices that the file buffer to be saved would raise (e.g. invalid or ambigious syntax) need to be displayed to the user who saves the file. The display is with two options how to proceed: canceling saving the file (default) so the warnings need to be removed or continue to save the file despite the errors.

This has mutliple reasons. First of all, implementors of the plugins are the best multiplicators. They give the best feedback. They need to provide the best user-interface sothat user can give the best feedback to the editor-config standard (ideal form of bug-reporting). Also: Users should be interactively trained and informed about the errors they made. We can not provide individual support to each user, therefore the software should be a good tool assisting the user doing the job and what could we do best teaching users how to save a .editconcfig file? Second, editor-config is about sharing and exchanging and a major part is not only reading but also writing. Therefore plugin should not on their own be limited to reading the editor-config file but also to writing it. Holistically.

Bytestream Normalization

This is one of the later ideas I had so it might be a little rough still. It was born out of the idea to make UTF-8 the default file-format any editor-config implementation must support. So I imagined the most simple UTF-8 based parser and to achieve this I imagined one that can safely drop a lot of certain bytes without altering the meaning of the file:

  • All bytes higher than \x7F / 127
  • All bytes lower than \x20 / 32 excluding:
    • \x0A / 10 as line-separator

To have this working, a bytestream normalization can take place before handling data to the parser. That means, if an implementation wants to support different line-separators, they only need to by changed into \x10 ("\n"). This is the case for data that comes from systems like Windows or for files that are encoded as Unicode using NEL U+0085 Next Line , LS U+2028 Line Separator or PS U+2029 Paragraph Separator.

So the bytestream normalization works as followed:

  • At (before) the beginning of the input the four octets "\x0A[]\x0A" are added as initial line-separator and selector. Next.
  • If EOF \x0A is added as terminating line-separator. Done.
  • Normalization of line-separators. (e.g. replace \x0D with \x0A; I suggested to relace \x0B-\x0D with \x0A)
  • Line-Separator is \x0A taken over. Next.
  • All characters from \x20 to \x7F are taken over. Next.
  • Everything else is dropped. Next.

Dropped means, the octet is removed from the stream. So this normalization can work as a filter. This can be prefixed with a filter that can deal with a more specific encoding, I'm not that strong with multi-byte charsets that are non-unicode, like for asian languages (e.g. GB18030). They might be disadvantaged for the default route to parse on the octet level however thanks to UTF-8 support a common international encoding is supported by default already (most part because of my personal civilisation, so review from other grounds highly appreceated). All single-byte charsets that share the us-ascii base work out of the box. Same for UTF-8. To a certain extend UTF-16 and UTF-32 might even work because we drop the NUL-bytes. BOMs from UTF-8 are no problem - those are removed automatically on the byte-level already.

So in the end, by having this byte-stream normalization the complexity of the parser and therefore the format is largely reduced. At least that is the idea. While maintaining inter-operable.

Also it is easy to support different encodings because they only need to output a normalized bytestream - the same parser can operate on them.

Case-Insensitivity

If the rules of case-sensitivity apply to the whole file (incl. selectors for files, see below), this would allow to also normalize case already in the byte-stream. E.g. make everything uppercase. This could further allow to simplify a parser. Otherwise case-sensitivity as applied on PROPERTY and VALUE.

For case sensitivity I still suggest that A-Z is equal to a-z and that's it. No further rules apply.

File-Format

The file format is a line-based format. Each line is separated from the next line by \x0A enclosed by as many space-characters \x20 as possible. This is extremely easy to handle because a initial byte-stream normalization adding one separator at the beginning and one at the end. This effectively removes the whitespace of the beginning and end of each line.

One line can only be of one if the following five types:

  1. empty - an empty line is dropped.
  2. comment - a comment is a line with with the comment character (";" / \x3B / 59 or other like "#" / \x23 / 35) at offset 0 and is dropped.
  3. selector - a selector is a line with "[" / \x5b / 91 at offset 0. It is optionally terminated with "]" / \x5D / 93 at the last offset or the end of the line whichever comes first. All characters between constitute a selector that applies to the following lines until the next selector appears.
    1. SELECTOR - no formal range of allowed bytes yet. A selector can be of zero-length matching anything.
  4. property/value pair - a more complex sequence of characters starting with a PROPERTY separated by "=" / \x3D / 61 (or optionally ":" / \x3A / 58) enclosed by as many space-characters \x2ß as possible from the VALUE.
    1. PROPERTY is a sequence of characters A-Za-z*
    2. VALUE is a sequence of any characters till the end of the line or the comment character (see above), whichever comes first.
  5. invalid - if a line by the order of this list does not constitute a type so far, is of type invalid and equal to empty. like an empty line, it is dropped.

Parser

The parser needs to map a filename/path via selectors to property/value pairs. This implies parsing/comparison rules for the selector itself. I do not define those rules in this memo. However, an empty file-name by definition matches any selector including the empty one.

To solve the problem matching nothing (e.g. retrieving the root property which is before the first selector), the default selector is an empty one. It might be worth to inject it into the byte-stream directly after the first \n that is prepended anyway, like "\n[]\n". The other alternative is to make the parser more complex.

If a selector matches, all lines it applies to (the sequence of following property/value pairs until the next selector) are used as property/values whereas a property is set to a value. If a property has been set to a value already, the later value is overwriting the previous one. If a property name is undefined, it is not taken into account. If a value is invalid for the property name, the property is as well not taken into account.

Because multiple selectors can match the filename all selectors need to be compared in the order they appear in the file.

Selector Pattern

The meaning of the pattern-language of the selectors has to be defined yet. Most of that work has been done already. Questions of case-sensitivity has been raised already.

Thanks to the byte-stream normalization filename encoding problems have already been moved out of the spec. What's missing here is the backdoor to bring characters back in that are out of the scope of the filtered characters. The percentage URL-encoding came to mind for myself for those who really need it or probably more straight forward \xFF octet and \uFFFF unicode encodings. Selectors already have escaping by backslash so that "[" can be used as part of the file-name and not part of the pattern (character class).

Square brackets as part of the pattern (character class) can be (more or less) safely used because the ending bracket is optional [this might change as formulating the tokenizer/grammar reveals]. Using it does allow to use any number of opening and closing square brackets inside the selector without further escaping them [with look-ahead line-based parsing this works easily, with stateless tokenization this might introduce some probs].

Fallbacks / Warnings

Depending on which characters are supported, a parser might want to issue warnings and notices. For example if an invalid line is found. Or one might to leave a notice if a line of type selector is not terminated by "]". These warning cases should be defined.

Some standard warning cases are:

  • Invalid PROPERTY name. There is a defined set of valid property names.
  • Invalid VALUE. There is a defined set of valid values per each property.

Parse Errors

If an implementation is not able to recover from an input, e.g. a re-encoding of an input encoding to UTF-8 for pre-bytestream-normalization is not possible, a parse error should be raised and parsing is failed.

Validation

Because of this simple set of rules, validation can be done either while parsing or in advance. A validator therefore can just rune a parsing against an empty file-name/path which by definition does match all selectors. Therefore all property/value lines are parsed.


So what gives? So far as outlined above minus the questions still open for selector patterns, should allow straight-forward parsing in an inter-operable manner. Thanks to the byte-stream normalization the grammar is pretty simple. The tokenization might be a bit more complex because of the many properties and values if those are broken down to the tokenization. This is work in progress, in the following variant I started to put this in, its a bit to get a feeling for it in formilization:

num         0|[1-9][0-9]*
char        [_a-zA-Z]
nmchar      [_a-zA-Z0-9]
sbo         [\[]
sbc         [\]]
sboc        [\[\]]
other       [!"\x24-\x2F<>?@\\\x5E-\x60{|}~\x7F]
nwchar      {nmchar}|{bkchar}|{other}
string      {nwchar}+([ ]+{nwchar})*
cmchar      [#;]
comment     {cmchar}{string}
aschar      [=:]
us          [_]
s           [ ]+
w           {s}?
nl          \n
ls          {s}*{nl}{s}*

A           [Aa]
B           [Bb]
C           [Cc]
D           [Dd]
E           [Ee]
F           [Ff]
H           [Hh]
I           [Ii]
L           [Ll]
M           [Mm]
N           [Nn]
P           [Pp]
R           [Rr]
S           [Ss]
T           [Tt]
U           [Uu]
O           [Oo]
W           [Ww]
Y           [Yy]
Z           [Zz]

charset     {L}{A}{T}{I}{N}"1"|{U}{T}{F}"-"(8("-"{B}{O}{M})?|16({B}|{L}){E})

FALSE       {F}{A}{L}{S}{E}
INDENT      {I}{N}{D}{E}{N}{T}
SPACE       {S}{P}{A}{C}{E}
TAB         {T}{A}{B}
LF          {L}{F}
LATIN1      


P_ROOT              {R}{O}{O}{T}
P_INDENT_STYLE      {INDENT}{us}{S}{T}{Y}{L}{E}
P_INDENT_SIZE       {INDENT}{us}{S}{I}{Z}{E}
P_TAB_WIDTH         {TAB}{us}{W}{I}{D}{T}{H}
P_END_OF_LINE       {E}{N}{D}{us}{O}{F}{us}{L}{I}{N}{E}
P_CHARSET           {C}{H}{A}{R}{S}{E}{T}
P_TRIM_TRAILING_WHITESPACE  {T}{R}{I}{M}{us}{T}{R}{A}{I}{L}{I}{N}{G}{us}{W}{H}{I}{T}{E}{S}{P}{A}{C}{E}
P_INSERT_FINAL_NEWLINE      [I}{N}{S}{E}{R}{T}{us}{F}{I}{N}{A}{L}{us}{N}{E}{W}{L}{I}{N}{E}

P_ANY               {P_ROOT}|{P_INDENT_STYLE}|{P_INDENT_SIZE}|{P_TAB_WIDTH}|{P_END_OF_LINE}|{P_CHARSET}|{P_TRIM_TRAILING_WHITESPACE}|{P_INSERT_FINAL_NEWLINE}

VT_TRUE             {T}{R}{U}{E}
VT_BOOL             {VAL_TRUE}|{FALSE}
VT_INDENT           {TAB}|{SPACE}
VT_LS               {C}{R}{LF}|{LF}
VT_CHARSET          {C}{H}{A}{R}{S}{E}{T}

VT_AMY              {VT_BOOL}|{VT_INDENT}|{VT_LS}|{VT_CHARSET}

%%

{comment}   {return COMMENT}
{kv}        
{ls}        {return LINE_SEPARATOR}

As the spec above, this tokenization does not cover the selector tokens (Wildcard Patterns) (well). Selector tokens should become part of it incl. the escaping those have (see http://editorconfig.org/).

Also this is blackboxing, I have not yet read the mailing-list posting. After reading it, this might change a lot, just saying.


Example file:

; EditorConfig <http://EditorConfig.org>

; top-most EditorConfig file
root = true

; Unix-style newlines with a newline ending every file
[*]
end_of_line = lf
insert_final_newline = true

; 4 space indentation
[*.py]
indent_style = space
indent_size = 4

; Tab indentation (no size specified)
[*.js]
indent_style = tab

; Indentation override for all JS under lib directory
[lib/**.js]
indent_style = space
indent_size = 2

@treyhunner
Copy link
Member

I like the idea of giving feedback to a user as they save a .editorconfig file so they know if something is malformed or even just non-standard. This will be even more important as the format becomes more error-tolerant.

In general we need to provide more suggestions to plugin developers for how to improve the user experience in their plugins. I think adding an option to indicate to users which .editorconfig files are being read and what properties are being displayed could help users new to the format that could become frustrated by using a feature they don't understand.

I don't have strong opinions about the bytestream normalization. My internationalization experience so far has all been web-centric (where the answer is always "just support UTF-8").

Normalization seems like it would only be a problem if we need to later allow characters from the ignored character set. For example CSV parsers often support a specifying separator string which could be a non-standard UTF-8 character. I can't think of a realistic example of this for EditorConfig so it might not be a problem.

Why is this UTF-8 issue not a problem for other settings files (.gitignore, .hgignore, .jslintrc, etc.)? It seems like allowing non-ASCII UTF-8 characters for matching filenames would be easier than requiring use of \x. I deal primarily with ASCII-friendly files so pardon my naivety of this problem.

The parser needs to map a filename/path via selectors to property/value pairs. This implies parsing/comparison rules for the selector itself. I do not define those rules in this memo. However, an empty file-name by definition matches any selector including the empty one.

So far the only property we have that is meaningful outside of any section is the root = true property. This property isn't meaningful for any one section and actually refers to the .editorconfig file itself (which is the highest EditorConfig file that should be parsed in that case) and not the file being matched.

If all files need to be matched a [*] section can be used. This is already commonly used in many .editorconfig files I've seen in the wild. I don't think the top-level default section should be meaningful. If we later need it to be meaningful for a special property (like editorconfig_version = 2 or some other option the plugins may need to know) then we could add support at that time. I can't think of a way backwards-compatibility would matter in that case.

If a property name is undefined, it is not taken into account. If a value is invalid for the property name, the property is as well not taken into account.

We actually continue to pass invalid property values to the plugins. We do so existing properties can be extended with more values later (utf-32 charset could be added for example) and so files can ignore a specific property by setting an invalid value (like end_of_line = mixed when a group of files can't be standardized properly).

Because multiple selectors can match the filename all selectors need to be compared in the order they appear in the file.

@xuhdev
Copy link
Member

xuhdev commented May 7, 2013

I also didn't see the point of dropping all bytes higher than \x7F / 127. As in some countries, naming the directories by their own languages is quite common among developers. As far as in my case, it works pretty well.

We currently have different meanings for options before any sections. They are global options. I didn't see any points to prepend []. [] might be used in the future, so why would we use this slot now?

For the "invalid" properties, I would rather say they are "not defined yet". They should be treated as legal lines, since editors could extend some by themselves. Such as, jEdit plugin supports jedit_charset to support the very large set of charsets that jEdit supports. For projects that all people using jEdit, this is a good deal. Another reason is that when a user has an .editorconfig file containing both new properties, for example, from 0.10.0 and 0.9.0, users of both 0.10.0 and 0.9.0 can work with this file well (i.e. no parsing error for 0.9.0).

But you are right, we could give some # [Warning] information to stderr when needed.

@hakre
Copy link
Author

hakre commented May 7, 2013

As in some countries, naming the directories by their own languages is quite common among developers.

The character-encoding inside the selector-patterns is currently undefined. If you would place bytes in there that are out of that range, it's unclear what is meant by that. My first suggestion here is to specify codepoint with some Unicode notation like \uXXXX.

Apart from that, you should also discuss the general problem when users are using such filenames in projects where they aim to exchange code (and therefore use an .editorconfig file). So it can be part of a standard that you say what you want to promote (like encoding of files which looks pretty limited in my eyes and I now see that it's limited for you, too).

I'm not saying that this is the solution, if you see my own formulation, I've pretty much excluded the patterns so far because I think there are some more things related that deserve a good answer with it. Encoding is one part o fit. Btw. what did you say so far about the encoding of the patterns? How does it relate to the encoding of the file itself and how to the encoding used in the file-system to name the files?

We currently have different meanings for options before any sections. They are global options.

I don't understand, in my case it's global options as well. All I do is that I specifically define the meaning of [] (which I think should be done anyway in the specs and I personally would prefer to not write reserved for future use in there). I must not place it in there, but I found it more consistent. It might be worth to further outline and say that that part is global options and that those can only be in the first section (no overwriting by other selectors). I take a note of that. TODO

For the "invalid" properties, I would rather say they are "not defined yet".

This is actually what they are, invalid only means to issue a warning (of some kind), like you suggest with stderr. I only wrote lightly about errors so far in the text, I think this is getting more clear when I show some code. We then need to go through and discuss each warning point anyway. I differ merely between two types: Those errors which bring the parser to halt so that it is a parse error. The file could not be parsed. And those errors which don't halt everyhing and allow the parser to continue, like a missing closing square bracket on a selector line, or the said invalid (undefined) property name or property value.

Such as, jEdit plugin supports jedit_charset to support the very large set of charsets that jEdit supports. For projects that all people using jEdit, this is a good deal.

Which raises question-marks on my end. Why shouldn't we make this part of the specs and discuss what good suggestions are for editors if they don't support a charset? IANA gives us a good repository of charset names and if the editor does not support it, the property could be treated as if it does not exists (fallback mode, could be useful for other properties as well). Then you don't need to introduce such shadow property for each editor-plugin editor "specific" value even.

Also this somehow reminds to discuss on how to extend anyway. How can I place my XML config blob in there for the code-style of my Eclipse for example? Should this be made more open or more inter-operable? I thought you actually aim for interop from the grounds, which means simplification and comes with some prices. But I think it's worth at that point. But sure you can only start one experiment :)

My suggestion for going on with the specification is the following: Let's first focus on the problems that lie within the selector patterns. I would for the moment stick to bytestream normalization just to keep the requirements low. It's not set in stone, but it keeps things easy for the moment. But imagine if we manage it that way, we have found a way that is us-ascii compatible which would be perfect for internet exchange across all borders. And what do we loose? I think nothing, If this is too burdensome, we can per the specs than say UTF-8 is supported and all parsers need to support it (like XML does). We could also still keep the requirements to drop the lower control codes and normalize line-separators. If you look closely you can see that editorconfig can be easily enclosed into an XML envelope (already with the specs right now).

@treyhunner
Copy link
Member

The character-encoding inside the selector-patterns is currently undefined. If you would place bytes in there that are out of that range, it's unclear what is meant by that.

I'm not sure how it's unclear. I think all characters up until a ] that comes before a # or ; should be considered part of the filename glob.

I can't imagine a good example for this so I'll use a contrived one with a filename borrowed from gitlabhq/gitlabhq#1776:

[中文.txt]
indent_style = space
indent_size = 4

That should match a specific filename, not any .txt file nor a file named .txt. We could require the user to escape those characters, but I don't see why that's necessary.

I should note that I just tried making a .editorconfig file with that glob and the Python editorconfig core did not see a file I made with a matching name. I'm not sure why that isn't supported currently.

I was confused about your statements about the top-level properties (root = true) because I thought you were equating them to other section properties. We just need to make it clear that those properties are special and at the current time are not sent to plugins that call the core libraries.

[*.js]
root = true

gives root = true in the output.

root = false
[*.js]

gives no output.

We have discussed in the past the ability for the EditorConfig file format to be extended for specialized cases. This could be useful for programs that parse EditorConfig files and want to specify some custom options for a specific use-case without requiring a new file and file format be specified. For example @jedhunsaker's recent mailing list post shows some extensions of the file format specific to JavaScript and used by a forked version of the codepainter tool.

I don't entirely understand the charset issue in jEdit but from the discussion that we had about it previously I believe the problem was that jEdit supports a variety of character sets that are more specific than the ones we provide and convey additional information about the file. This setting is jEdit-specific and I'm not sure what it's useful for, but I believe it was a compromise allowing further customization for jEdit that isn't available in other editors.

@hakre
Copy link
Author

hakre commented May 7, 2013

@treyhunner: it is unclear because is it the encoding in the file? is it the encoding on disk? is it both? none of them? Which Unicode flavor do you use? Which Unicode normalization? How does the pattern engine match Unicode? Does it? ... . I do not have any answers to these questions and my saying is: let us see what practically is issued. The ticket you linked has been closed by the owner, last comment suggests it was bogus. A real life example would be good actually, but hopefully nobody is that crazy to configure files with such esoteric names in a shared project on the internet. At least not if the software is important and intended to share (and that is the domain where .editorcofing is applicable IMHO). And just rest assured that I as well can make up cases theoretically where issues arise. I just decided not to for the moment and went into bytestream normalization. I know it is pretty strict, but apart from esoteric examples which can be covered with some unicode-escape notation, even that strict, it doesn't destroy anything.

About jedit I think as well this is a compromise, however why make compromises when the specs are still open? We can not say which encodings are available in other editors or not. It also should not matter. What should matter is that the .editorconfig file says what the intended encoding for that file-type is. If the encoding is not available, then the editor can only violate it, sure, however not telling the encoding is defeating .editorconfig. This probably was a bit too strict thought. But I don't know if there was any discussion about it, I hopefully don't summon any demons here.

Also as jedit is an editor I can imagine that discussing this can have positive influence on the encoding property. See the spaces_in_parens you point to, I would say, this is the kind of feedback that is good to know about. These are users that will adopt if it works for them. But on the other hand I would take care that the core of the format actually works for them, otherwise there is not much use in .editorconfig (if everybody is gardening it's own extensions, there is not much use in sharing this if you understand what I mean). Just my 2 cents, it's also late and this is quickly written, sorry for that.

@xuhdev
Copy link
Member

xuhdev commented May 8, 2013

As in some countries, naming the directories by their own languages
is quite common among developers.

The character-encoding inside the selector-patterns is currently
undefined. If you would place bytes in there that are out of that range,
it's unclear what is meant by that. My first suggestion here is to
specify codepoint with some Unicode notation like |\uXXXX|.

I can see your point here. And I know that in C specification, only
'\uXXXX' should be used for the characters "out of range". But, in the
real world, I haven't seen anyone writing strings like '\uXXXX'
everywhere in his code.

But in our case, if we have a UTF-8 character out of range, we must
ensure that the disk file names are also encoded in UTF-8 to make it
work. Yes, that's true, but the solution I would like to give is, when
matching patterns, convert the encoding of file names in the filesystem
to UTF-8 if they are not encoded in UTF-8. I think there should be some
API to obtain the encoding of the current filesystem. But the C core
library does not handle this case well. For a temporary notice, we could
warn users that if you are using those UTF-8 characters larger than 127
in your editorconfig files, it may not work if your filesystem is not
encoded in UTF-8.

Apart from that, you should also discuss the general problem when users
are using such filenames in projects where they aim to exchange code
(and therefore use an |.editorconfig| file). So it can be part of a
standard that you say what you want to promote (like encoding of files
which looks pretty limited in my eyes and I now see that it's limited
for you, too).

I don't understand your point here.

I'm not saying that this is the solution, if you see my own formulation,
I've pretty much excluded the patterns so far because I think there are
some more things related that deserve a good answer with it. Encoding is
one part o fit. Btw. what did you say so far about the encoding of the
patterns? How does it relate to the encoding of the file itself and how
to the encoding used in the file-system to name the files?

The patterns have the same encoding as the .editorconfig file itself. I
would say the C core code only handles filesystems of UTF-8 cases well.

We currently have different meanings for options before any
sections. They are global options.

I don't understand, in my case it's global options as well. All I do is
that I specifically define the meaning of |[]| (which I think should be
done anyway in the specs and I personally would prefer to not write
/reserved for future use/ in there). I must not place it in there, but I
found it more consistent. It might be worth to further outline and say
that that part is global options and that those can only be in the first
section (no overwriting by other selectors). I take a note of that. TODO

Properties before any sections are totally different from other
properties. My global options mean they are global options for the
parser. They are not set for any files. For example, if you have
indent_style = space before sections, it will be ignored, but not set
globally.

For the "invalid" properties, I would rather say they are "not
defined yet".

This is actually what they are, invalid only means to issue a warning
(of some kind), like you suggest with stderr. I only wrote lightly about
errors so far in the text, I think this is getting more clear when I
show some code. We then need to go through and discuss each warning
point anyway. I differ merely between two types: Those errors which
bring the parser to halt so that it is a parse error. The file could not
be parsed. And those errors which don't halt everyhing and allow the
parser to continue, like a missing closing square bracket on a selector
line, or the said invalid (undefined) property name or property value.

Warning should be acceptable once users can disable it.

Such as, jEdit plugin supports jedit_charset to support the very
large set of charsets that jEdit supports. For projects that all
people using jEdit, this is a good deal.

Which raises question-marks on my end. Why shouldn't we make this part
of the specs and discuss what good suggestions are for editors if they
don't support a charset? IANA gives us a good repository of charset
names and if the editor does not support it, the property could be
treated as if it does not exists (fallback mode, could be useful for
other properties as well). Then you don't need to introduce such shadow
property for each editor-plugin editor "specific" value even.

Not every editor follows IANA. For example, according to IANA, you can
use "cp936" or "ms936" to refer the same charset, but editors are
supporting them differently. Allowing editors to extend them should be
fine when all developers use the same editor in the project. Also, any
properties extended by editors should start with <editor_name>_, so
it's very clear which is defined by editors, which is defined in the
specification. For reasonable ones, we definitely would like to include
in the specification.

Also this somehow reminds to discuss on how to extend anyway. How can I
place my XML config blob in there for the code-style of my Eclipse for
example? Should this be made more open or more inter-operable? I thought
you actually aim for interop from the grounds, which means
simplification and comes with some prices. But I think it's worth at
that point. But sure you can only start one experiment :)

Sorry, I don't understand what you mean here. Do you mean we need to
tool to convert Eclipse XML which defines coding styles to an
.editorconfig files?

@hakre
Copy link
Author

hakre commented May 8, 2013

@treyhunner: I could now read the whole thread in the mailinglist about the property you point to and also to the related code-style for javascript that this hints. I think it is a good use-case if plugin-authors want to support such a standard, that they also can use .editorconfig for it. This is a very fine example. Perhaps like in CSS where there are "vendor-specific" prefixes, this could be an option as well here. Such usergroups can create meta-data properties they need and if within the .editorconfig one finds out this is not only with a single-standard and/or editor-feature (imagine somebody writes a plugin to support idiomatic.js in an editor and wants to re-use .editorconfig to place meta-information there), those can be migrated to "official .editorconfig" ones. Why not? This could make it lightweight for others to experiment with their own ideas and allow to better discuss things because you can get more practical easily.

@xuhdev: Yes those encoding problems need some thought. I also had an idea similar to your suggestion in mind to solve the mapping of filename encoding and pattern encoding. If the encoding of the .editorconfig is always UTF-8 (or the US-ASCII subset), this is less a burden (and most likely the perfect outcome). However I'm not yet booked how to come there safely. Another problem that needs handling is: http://en.wikipedia.org/wiki/Unicode_equivalence

And yes I know that my suggestion to use \uXXXX is bordensome for some folks. Also this needs definition that it is UTF-16 and so on. Like in C you can find the same I think in Java and JSON, therefore I'd say, it's quite acceptable nevertheless. But as written bytestream-normalization is just a tool here to keep requirements distinct in the beginning. I'm not totally happy and sure it would be great if UTF-8 can be easily supported well in all plugins from the beginning. I did start my first implementation with UTF-8 and it's perfectly feasible, the byte-stream idea is highly driven by some pragmatic considerations. You might be able to wrap your mind about it as well, maybe this is more visible if you take it as something interrim only. In the end I'm also not happy with bytestream normalization at a certain point because it's somehow like a code-smell to have it. This is more a feeling of mine. Let's say I'm not totally happy with it either that one needs to do that. But probably it's still sane as long everything editorconfig can be expressed. I like the idea to not need to care about encoding much at this point. And to have explicit support for US-ASCII.

In reality I think we could also make UTF-8 support a must. Like XML does. Then parsers just need to treat it properly, the comparison functions for the pattern need to support one of the Unicode normalization forms and the plugins need to take care to provide the file-data in UTF-8 as well as the file-names to the comparison function. This is perfectly possible, I don't have an issue with this per-se. It's just much more work to properly define this. And is it worth? Are those the use-cases that we see often? Can't we defer until support is needed? Or do you think it is wrong to defer this?

And I dunno how much work the definition of the patterns itself already is. So it would be nice for the progress to just say, okay let's stick to bytestream normalization for the moment (nothing set in stone) and continue with the specification (if everybody is okay with that for a moment, I know and feel it's not perfect).

From the feedback so far I think error handling / warning raises the most questions next which I can imagine because this is also dependent on usage. For charset generally I think it worth to make clear which meta-data .editorconfig wants to provide. I have the smell in the nose that the current discussion about that format was a bit in the wrong direction. This is why I criticized the current usage. However in the tokenizer you can see that I put in not the IANA registry but the editorconfig registry of charsets. I only wrote that I think this is wrong and I think this needs discussion at a certain point but if you ask me, not now. I think filename selector patterns are most important right now and probably to lax the definition of property names so that we don't need to discuss each and every property name. This should be open for changes anyway, so that it would be counter-productive if the parser would not be able to follow a certain pattern for property names.

Next to new properties/vendor properties the XML chunk example given was in the meaning of: Why think small? Not an XML converter (I think that would be plugin material, would have not much do to with the specification) but shouldn't it be possible to put even more meta-information in there while we're at it (not only single-line values of limited length and with reduced charset). But that's nothing to decide from one day to the other as far as me is concerned, I just was throwing that in to have it named once. I mean we raise so many theoretical issues while doing this, so I see it in that tradition to put some thoughts on the table when they pop into mind.

So I think I'll formulate more the properties file and laxer tokenization to allow individual/vendor properties, make more distinct in the docs how global options work (in difference to the pattern/selector sections) as this has raised quite some questions and will try to start defining the pattern language used inside the selectors.

As I see it right now, the topic about warnings and errors needs some love. But I'm not yet booked how to continue with that. One part is to make suggestions for plugin authors which are relatively high-level. Some plugin authors also might not want that. Therefore this first should say which warnings / errors a parser provides and when (and are their classes and are there offsets/line/column numbers given etc.), what an implementer then does with this should be deferred to a later stage. It's just to say first of all which warnings / notices a parser could be able to generate. In the end a lint-application will deal differently with this as a plugin in an editor or some server-side process. Therefore it makes not much sense to break the head over this until it's visible which warnings we can provide at all which depends on how to define the parsing (why not having a tool suite providing a lint command?). So this probably gets better traction when a default implementation is discussed as that can document warnings. It's probably also useful to not define this at all because others might want to implement a parser differently and therefore do different error-handling, so as I think everybody can see, in that part there is a lot to discuss on different levels. Maybe like with the bytestream just put it a little behind. A parser just works or not in the most basic sense, therefore for a first proof of concept implementation this shouldn't be even necessary.

@treyhunner
Copy link
Member

I'm glad you see the use case for allowing community-driven extension of the format. I also see this as similar to the recommendations for creating new HTTP headers and new CSS properties. Non-standard properties will become standard as it makes sense.

I still don't understand bytestream normalization, but I'm planning to drop that issue for now so I don't get hung up on that while discussing your other points.

I definitely see the warnings/errors as a suggestion for plugin authors and not a mandate. It would be convenient if there were a feature in the EditorConfig libraries (or related libraries) to state recommended warnings/errors based on a given .editorconfig file. That could make implementation easier for plugin authors.

I think this would work similar to a linter tool. I'm imagining that in Sublime Text 2 or Vim for example, when a user is editing a .editorconfig file warnings/errors would show up in the gutter to the left or would highlight the problem line and hovering over the line would display the warning/error messages.

@JoeRobich
Copy link

@treyhunner,
Since the file format documentation has been updated to say

EditorConfig files use an INI format that is compatible with the format used by Python ConfigParser Library

and the Python ConfigParser supports multi-line values

key: line 1
line 2
line 3

Can I use multi-line values for a custom property and expect it to be supported?

@jednano
Copy link
Member

jednano commented Jan 6, 2020

Short answer: please don’t.

I don’t know why there are more than one specification in the wild, but it was my impression that the single source of truth should be https://editorconfig-specification.readthedocs.io/en/latest/

There, that specific call-out to the ConfigParser library was removed. Makes sense too, because the other cores are not written in Python.

@JoeRobich
Copy link

@jedmao Interesting how the readthedocs.io specification differs from https://editorconfig.org/#file-format-details . It is unfortunate if it isn't supported because it would be quite useful for some settings.

@jednano
Copy link
Member

jednano commented Jan 6, 2020

The ultimate source of truth are the tests which pass for all cores. If you can’t find multiline support in there (and I think you won’t), then it’s not officially supported.

@jednano
Copy link
Member

jednano commented Jan 6, 2020

I’m also advocating a single source of truth for the specification, if we can consolidate.

@JoeRobich
Copy link

If you can’t find multiline support in there (and I think you won’t), then it’s not officially supported.

Thanks @jedmao. Since all the core properties are simple values there would be no need for multi-line support or tests. However for custom properties there are lots of cases where multi-line would be beneficial. Snippets, Templates, File Headers, etc...

Let me also voice support for a single unambiguous specification.

@jednano
Copy link
Member

jednano commented Jan 6, 2020

FWIW, I was aiming to decouple just the INI parsing aspect of EC into a Rust implementation (for wasm). See https://github.com/jedmao/editorconfig-ini

@cxw42
Copy link
Member

cxw42 commented Aug 11, 2021

Now that we have a formal specification (https://editorconfig-specification.readthedocs.io/), can this issue be closed?

@xuhdev
Copy link
Member

xuhdev commented Aug 11, 2021

Yes I think so. Closing

@xuhdev xuhdev closed this as completed Aug 11, 2021
@JoeRobich
Copy link

Maybe the specification on editorconfig.org (https://editorconfig.org/#file-format-details) should be changed to point to the formal specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants