-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disambiguate Slice and JSON languages for .ice #4376
Disambiguate Slice and JSON languages for .ice #4376
Conversation
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Original sample [block-ledmatrix-64x64.ice](https://github.com/scanlime/icebreaker-icestudio-ledmatrix/blob/27815aa2406eedaff34a86cfee3d9ea1cade30a4/block-ledmatrix-64x64.ice#L1), [Unlicense](https://github.com/scanlime/icebreaker-icestudio-ledmatrix/blob/27815aa2406eedaff34a86cfee3d9ea1cade30a4/LICENSE) Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Original sample [testSlice01.ice](https://github.com/paulkoerbitz/language-slice/blob/113b2e2c0f4b8413f72bc9567976f87ae797460e/test/testSlice01.ice#L1), [BSD 3-Clause](https://github.com/paulkoerbitz/language-slice/blob/113b2e2c0f4b8413f72bc9567976f87ae797460e/LICENSE) Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Ouch. That JSON sample is huge. Any chance it could be shortened? While we appreciate large, detailed samples, stuff like this is normally overkill for Linguist, and adds undue weight/processing time to the classifier. Other than that, this LGTM. 👍 |
Sure, but so far replacing current block-ledmatrix-64x64.ice (3891 sloc) 118 KB by any of the
fails the Will try to find some sweet spot. |
lib/linguist/heuristics.yml
Outdated
@@ -158,7 +158,7 @@ disambiguations: | |||
- language: Slice | |||
pattern: '^\s*(#\s*(include|if[n]def|pragma)|module\s+[A-Za-z][_A-Za-z0-9]*)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be invalid syntax for a Slice file to start with {
?
If so, we can replace this heuristic with:
- language: Slice
- pattern: '^\s*(#\s*(include|if[n]def|pragma)|module\s+[A-Za-z][_A-Za-z0-9]*)'
+ pattern: '\A(?!\s*[{\[])'
I suspect the reason why the tests are failing is because Slice files which don't contain lines like #include…
are being left to the classifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A "proper" Slice file should never start with a {
. I think it might be possible to contrive an example, but no one should write it like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be enough then. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in f09b6ea
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
b213793
to
361ce64
Compare
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
361ce64
to
d8de0cb
Compare
Have replaced 120kb JSON sample for .ice with 8kb one in 361ce64 - Classifier test on CI fails the same way. Added the smallest sample I could find to be passing Classifier tests, 96kb block-ledmatrix-generic.ice (from same repo under the same license) in d8de0cb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @bzz!
Could you open a new issue to discuss the issue you found with the tests? (Copy and paste is fine.)
lib/linguist/heuristics.yml
Outdated
@@ -156,9 +156,9 @@ disambiguations: | |||
- extensions: ['.ice'] | |||
rules: | |||
- language: Slice | |||
pattern: '^\s*(#\s*(include|if[n]def|pragma)|module\s+[A-Za-z][_A-Za-z0-9]*)' | |||
pattern: '\A(?!\s*[{\[])' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the exact opposite of the JSON regex? Do we need both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to get by with only '\A(?!\s*[{\[])'
for Slice. That, or remove the pattern altogether and use an open-ended fallback to Slice with JSON checked first. Which is probably clearer in the long-run. I'm still puzzled as to why the tests are failing. 😕
Also, I can't even see the diff I'm reviewing. 😀 Scrollbar isn't going away:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the pattern altogether and use an open-ended fallback to Slice with JSON checked first
addressed in 9c2302a
I'm still puzzled as to why the tests are failing.
I've put back smallest JSON sample for .ice in bd3b268 and one can see that it's not heuristics but only a Classifier test that is failing TestClassifier#test_classify_ambiguous_languages
.
This Classifier test seems to be designed to look for all the languages that are ambiguous by extension (JSON and Slice in .ice
case) and then assume Classifier.classify
would be able to distinguish those without using heuristics strategy that we are adjusting in this PR.
This seems to be not possible, only based on samples that we have for this pair of languages
8kb for Slice (that also uses {} and []
) VS 216kb total for JSON:
JSON = -600.095 + -4.551 = -604.646
Slice = -565.507 + -6.949 = -572.456
instead of adding bigger JSON sample, I tried adding +16kb of Slice one and it seems to almost work
24kb Slice VS 216kb for JSON:
JSON = -600.095 + -4.552 = -604.647
Slice = -596.006 + -6.544 = -602.551
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+12kb of Slice samples (total +32kb VS +96kb in previous approach) seems to do the trick
JSON = -600.095 + -4.552 = -604.647
Slice = -624.850 + -6.257 = -631.108
🎉
now I just need to find those 32kb of Slice not under GPL and push them in here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! Wait, so, the heuristics strategy doesn't always run??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not in TestClassifier
as this test directly uses only a low-level Bayesian classifier API Classifier.classify()
I belive it may be the assumptions of this test that is too strong - for some reason it assumes that all colliding extensions can be disambiguated by Bayesian classifier directly.
When in Linguist \w multiple strategies this classifier should be able to disambiguate only the cases that are not covered by heuristics already.
Sorry for delay - just got back from vacation and will open an issue/try the suggestion asap. |
@pchaigno Think we should add a test to guard against heuristics with patterns that look like RegExp literals? Stuff like this is easy to overlook, since we're used to seeing RegExp patterns as Ruby/Perl/JavaScript literals: - '/foo/'
- '/bar/i' Patterns which actually need to match stuff enclosed by forward slashes can always use character classes: - '[/]foo/'
- '/foo/[i]' The fact I completely overlooked this mistake in review suggests it might happen again, so it wouldn't hurt to add a test for it. (Have I mentioned that YAML is the worst format in the world, and that Ruby is more suited to being a configuration language than anything programming-related? 😁 |
Shouldn't this be covered by the tests for the new heuristic rules though? Maybe we should check that there are no empty fixture in
|
That'd be even better, actually. 👍 |
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
That's it. I'm cloning your branch locally and investigating this myself. This isn't right. |
I suspect a bug in the Bayesian classifier... Good luck! 😜 |
Yeah, erh, I think I'll leave that up to you to figure out. 😅 I tested the regexp in |
Origin: https://github.com/mumble-voip/mumble/blob/e31d267a11b4ed0597ad41309a7f6b715837141f/src/murmur/Murmur.ice License: BSD-style https://github.com/mumble-voip/mumble/blob/e31d267a11b4ed0597ad41309a7f6b715837141f/LICENSE Signed-off-by: Alexander Bezzubov <bzz@apache.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a fix for Slice vs JSON language disambiguation heuristic. It was preliminary discussed at #4243 (comment)
Before:
.ice
files detected as Slice language.But as noted in Add Slice language (.ice) #4243 (comment), many are just JSON as
*.ice
is used by IceStudioAfter:
Description
Fix includes multiple moving parts:
heuristics.yml
disambiguating "Slice" and "JSON"..ice
extension to JSON inlanguages.yml
.Without this, the above heuristics will never kick in, as Extension strategy will always have only 1 match.
.ice
sample in JSON, so a 8kb block-sync-counter8.ice was added, The Unlicense (took quite some time to find a non-GPL IceStudio example).Why this was not noticed in original #4243?
Moved to #4391
Checklist:
.ico