Skip to content
This repository has been archived by the owner on Mar 3, 2023. It is now read-only.

Prioritize file type matching over first line matching in grammar scoring system #9513

Merged
merged 9 commits into from
Nov 12, 2015
Merged

Prioritize file type matching over first line matching in grammar scoring system #9513

merged 9 commits into from
Nov 12, 2015

Conversation

Ingramz
Copy link
Contributor

@Ingramz Ingramz commented Nov 10, 2015

Fixes #2734

This is a pull request version of #2734 (comment) in order to work together on solving the currently unfair grammar scoring system. The main problem with the current system is that it gives a higher score to HTML grammar for any file that begins with HTML doctype, but the file might be using PHP or some other (templating) language that utilizes HTML syntax (with a different file type, eg php).

@nathansobo suggested adding an option to fall back to using the old system in case the new one breaks something, but I am not sure how to do that exactly. Also another question would be whether this transition period is needed at all, because all it "breaks" right now is grammar selection for files, for which there is a known file type, but the first line is matched from some other grammar. A fix for that case would be just adding the file type via user configuration, to override the known type from the other language.

@nathansobo
Copy link
Contributor

@Ingramz Thanks. I buy your argument that the existing override structure is enough. Let me review this...

@nathansobo
Copy link
Contributor

@Ingramz If you look at Travis, It looks like some tests related to the change code are failing in core specs. It looks like we're asserting exactly the opposite of the new behavior, so you'll need to update that. Also, you should add a test that asserts that the path-based selection wins over the first line regexp.

bestMatch

# Extended: Returns a {Number} representing how well the grammar matches the
# `filePath` and `contents`.
getGrammarScore: (grammar, filePath, contents) ->
return Infinity if @grammarOverrideForPath(filePath) is grammar.scopeName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good.

@nathansobo
Copy link
Contributor

You had commented with a long description of this logic elsewhere. Could you transfer that into the body of this PR?

@Ingramz
Copy link
Contributor Author

Ingramz commented Nov 10, 2015

From the other post.

This will return Infinity as a score for the overridden grammar, which in my opinion is a little more elegant than 2 + length of file path.

The path scoring system is already good as-is. Longer match will get a higher score, plus 0.5 bias for user defined file types. I find it very clever that the user can select between two grammars by adding the file type in question to the user defined list, even if it already was defined in grammar so one of the two grammars always gets priority with the same matched longest file type.

Also there is a bias of 0.25 so that a non-bundle package would win over a bundled package, if their matches were the same length (ergo the file types matched are the same), this is the same as the code that was on lines 39/40. However one thing to note - if we get -1 as a file type score for all grammars and there is a non-bundle grammar, then that would win over all bundled grammars with a score of -0.75, it seems that this oddity exists also in original code. I think that this bias shouldn't be applied if the path score itself is -1. <- update: this has been fixed in PR

Finally a bias for the case where the first line of the file matches a file, which is currently the weakest - 0.125. This should help in case the two grammars were equal in the previous cases. Setting this with the smallest bias makes more sense to me as it should be used only as a last resort, which works really well for files which the type is unknown.

@nathansobo
Copy link
Contributor

Cool, thanks for the explanation. @maxbrunsfeld do you think you could take a look at this too? Maybe we could screen share and just try to think through the implications of changing this and be sure I'm not missing anything.

@Ingramz
Copy link
Contributor Author

Ingramz commented Nov 10, 2015

The numbers come from the sequence 1/pow(2, n), for which the sum for n=1...infinity is 1, so if we keep going smaller, we hopefully never reach 1 (so let's make sure we're not summing all of the possible terms in that sequence). This way the small increments never override the file type's priority. This also applies recursively, sum for n=2..infinity is 1/2 etc, so all increments smaller than 0.5 will also stay smaller than 0.5, which helps with the priority.

@nathansobo
Copy link
Contributor

Cool deal about the numbers. Very nice thought process. Now just I need to see tests of new functionality and a green build.

@Ingramz
Copy link
Contributor Author

Ingramz commented Nov 11, 2015

@nathansobo seems like the tests are passing now.

@nathansobo
Copy link
Contributor

Okay. I'm a bit nervous that I'm not thinking through some implication but this is why we have a beta phase. Merging. Thanks @Ingramz for the thoughtful implementation here. Awesome work.

nathansobo pushed a commit that referenced this pull request Nov 12, 2015
Prioritize file type matching over first line matching in grammar scoring system
@nathansobo nathansobo merged commit df0671c into atom:master Nov 12, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PHP and HTML Syntax Coloring issue
4 participants