Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disambiguate Slice and JSON languages for .ice #4376

Merged
merged 8 commits into from
Jan 31, 2019

Conversation

bzz
Copy link
Contributor

@bzz bzz commented Jan 10, 2019

This a fix for Slice vs JSON language disambiguation heuristic. It was preliminary discussed at #4243 (comment)

Before:

After:

  • some .ice are Slice
  • others, IceStudio ones, are JSON

Description

Fix includes multiple moving parts:

  • A fix Regexp syntax in heuristics.yml disambiguating "Slice" and "JSON".
  • Add .ice extension to JSON in languages.yml.
    Without this, the above heuristics will never kick in, as Extension strategy will always have only 1 match.
  • That requires to have at least one .ice sample in JSON, so a 8kb block-sync-counter8.ice was added, The Unlicense (took quite some time to find a non-GPL IceStudio example).
  • That requires to have more samples of actual Slice, as otherwise Classifier would not be able to distinguish samples, so a 40kb Murmur.ice, BSD 3-Clause was added.

Why this was not noticed in original #4243?
Moved to #4391

Checklist:

  • I am fixing a misclassified language Slice VS JSON for .ico

@bzz bzz mentioned this pull request Jan 10, 2019
5 tasks
@bzz bzz changed the title Disambiguate Slice and JSON languages for .ico Disambiguate Slice and JSON languages for .ice Jan 11, 2019
@Alhadis
Copy link
Collaborator

Alhadis commented Jan 11, 2019

Ouch. That JSON sample is huge. Any chance it could be shortened?

While we appreciate large, detailed samples, stuff like this is normally overkill for Linguist, and adds undue weight/processing time to the classifier.

Other than that, this LGTM. 👍

@bzz
Copy link
Contributor Author

bzz commented Jan 11, 2019

That JSON sample is huge. Any chance it could be shortened?

Sure, but so far replacing current block-ledmatrix-64x64.ice (3891 sloc) 118 KB by any of the

fails the TestClassifier#test_classify_ambiguous_languages tests as it can not distinguish JSON vs Slice.

Will try to find some sweet spot.

@bzz
Copy link
Contributor Author

bzz commented Jan 11, 2019

@Alhadis I have updated the comment above.

The smallest JSON sample could find that passes the tests is this (3104 sloc) 93.9 KB sample. It's 25kb smaller then the current one.

Please advice on how to proceed.

@@ -158,7 +158,7 @@ disambiguations:
- language: Slice
pattern: '^\s*(#\s*(include|if[n]def|pragma)|module\s+[A-Za-z][_A-Za-z0-9]*)'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be invalid syntax for a Slice file to start with {?

If so, we can replace this heuristic with:

   - language: Slice
-    pattern: '^\s*(#\s*(include|if[n]def|pragma)|module\s+[A-Za-z][_A-Za-z0-9]*)'
+    pattern: '\A(?!\s*[{\[])'

I suspect the reason why the tests are failing is because Slice files which don't contain lines like #include… are being left to the classifier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "proper" Slice file should never start with a {. I think it might be possible to contrive an example, but no one should write it like that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be enough then. 👍

Copy link
Contributor Author

@bzz bzz Jan 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in f09b6ea

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz bzz force-pushed the fix/json-slice-disambiguation branch from b213793 to 361ce64 Compare January 12, 2019 21:49
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz bzz force-pushed the fix/json-slice-disambiguation branch from 361ce64 to d8de0cb Compare January 12, 2019 21:56
@bzz
Copy link
Contributor Author

bzz commented Jan 12, 2019

Have replaced 120kb JSON sample for .ice with 8kb one in 361ce64 - Classifier test on CI fails the same way.

Added the smallest sample I could find to be passing Classifier tests, 96kb block-ledmatrix-generic.ice (from same repo under the same license) in d8de0cb

Copy link
Contributor

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @bzz!

Could you open a new issue to discuss the issue you found with the tests? (Copy and paste is fine.)

@@ -156,9 +156,9 @@ disambiguations:
- extensions: ['.ice']
rules:
- language: Slice
pattern: '^\s*(#\s*(include|if[n]def|pragma)|module\s+[A-Za-z][_A-Za-z0-9]*)'
pattern: '\A(?!\s*[{\[])'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the exact opposite of the JSON regex? Do we need both?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to get by with only '\A(?!\s*[{\[])' for Slice. That, or remove the pattern altogether and use an open-ended fallback to Slice with JSON checked first. Which is probably clearer in the long-run. I'm still puzzled as to why the tests are failing. 😕

Also, I can't even see the diff I'm reviewing. 😀 Scrollbar isn't going away:
a

Copy link
Contributor Author

@bzz bzz Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alhadis

remove the pattern altogether and use an open-ended fallback to Slice with JSON checked first

addressed in 9c2302a

I'm still puzzled as to why the tests are failing.

I've put back smallest JSON sample for .ice in bd3b268 and one can see that it's not heuristics but only a Classifier test that is failing TestClassifier#test_classify_ambiguous_languages.

This Classifier test seems to be designed to look for all the languages that are ambiguous by extension (JSON and Slice in .ice case) and then assume Classifier.classify would be able to distinguish those without using heuristics strategy that we are adjusting in this PR.

This seems to be not possible, only based on samples that we have for this pair of languages

8kb for Slice (that also uses {} and []) VS 216kb total for JSON:

      JSON =   -600.095 +  -4.551 =   -604.646
     Slice =   -565.507 +  -6.949 =   -572.456

instead of adding bigger JSON sample, I tried adding +16kb of Slice one and it seems to almost work

24kb Slice VS 216kb for JSON:

      JSON =   -600.095 +  -4.552 =   -604.647
     Slice =   -596.006 +  -6.544 =   -602.551

Copy link
Contributor Author

@bzz bzz Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+12kb of Slice samples (total +32kb VS +96kb in previous approach) seems to do the trick

      JSON =   -600.095 +  -4.552 =   -604.647
     Slice =   -624.850 +  -6.257 =   -631.108

🎉

now I just need to find those 32kb of Slice not under GPL and push them in here :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Wait, so, the heuristics strategy doesn't always run??

Copy link
Contributor Author

@bzz bzz Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not in TestClassifier as this test directly uses only a low-level Bayesian classifier API Classifier.classify()

I belive it may be the assumptions of this test that is too strong - for some reason it assumes that all colliding extensions can be disambiguated by Bayesian classifier directly.
When in Linguist \w multiple strategies this classifier should be able to disambiguate only the cases that are not covered by heuristics already.

@bzz
Copy link
Contributor Author

bzz commented Jan 24, 2019

Sorry for delay - just got back from vacation and will open an issue/try the suggestion asap.

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 24, 2019

@pchaigno Think we should add a test to guard against heuristics with patterns that look like RegExp literals? Stuff like this is easy to overlook, since we're used to seeing RegExp patterns as Ruby/Perl/JavaScript literals:

- '/foo/'
- '/bar/i'

Patterns which actually need to match stuff enclosed by forward slashes can always use character classes:

- '[/]foo/'
- '/foo/[i]'

The fact I completely overlooked this mistake in review suggests it might happen again, so it wouldn't hurt to add a test for it.

(Have I mentioned that YAML is the worst format in the world, and that Ruby is more suited to being a configuration language than anything programming-related? 😁 </snark>).

@pchaigno
Copy link
Contributor

Think we should add a test to guard against heuristics with patterns that look like RegExp literals

Shouldn't this be covered by the tests for the new heuristic rules though? Maybe we should check that there are no empty fixture in test_heuristics.rb instead?

I belive that happened because despite the advised test that supposed to be checking disambiguation by heuristics was added, it was failing silently due to empty fixture in test_heuristics.rb#L194. It was empty because there were no samples of .ice in JSON.

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 24, 2019

That'd be even better, actually. 👍

@bzz
Copy link
Contributor Author

bzz commented Jan 24, 2019

Could you open a new issue to discuss the issue you found with the tests? (Copy and paste is fine.)

@pchaigno done in #4391

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@Alhadis
Copy link
Collaborator

Alhadis commented Jan 24, 2019

What the hell:

  1) Failure:
TestClassifier#test_classify_ambiguous_languages [/home/travis/build/github/linguist/test/test_classifier.rb:56]:
/home/travis/build/github/linguist/samples/JSON/block-sync-counter8.ice
[["Slice", -572.4646846236486], ["JSON", -604.6475163156096]].
Expected: "JSON"
  Actual: "Slice"

That's it. I'm cloning your branch locally and investigating this myself. This isn't right.

@pchaigno
Copy link
Contributor

That's it. I'm cloning your branch locally and investigating this myself. This isn't right.

I suspect a bug in the Bayesian classifier... Good luck! 😜

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 24, 2019

Yeah, erh, I think I'll leave that up to you to figure out. 😅 I tested the regexp in irb to make sure it DID match the patterns it's given to match, so it couldn't possibly be related to the heuristics themselves... 😢

Copy link
Collaborator

@Alhadis Alhadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huzzah, the tests are passing! Well done, @bzz.

@lildude, @pchaigno LGTM. 👍

@lildude lildude merged commit b1ed2c0 into github-linguist:master Jan 31, 2019
@bzz bzz deleted the fix/json-slice-disambiguation branch February 5, 2019 09:09
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants